SlideShare una empresa de Scribd logo
1 de 41
Hardware/Software Co-Design

       Lecture MPSoC 1
5. Multiprocessor Architectures
• 5.1 Introduction
  – The focus is on embedded microprocessors study
  – Multiprocessing (MP) is very common in embedded
    computing because
     • Allows us to meet our performance, cost and energy/power
       consumption goals
  – Embedded MP are often heterogeneous
    multi-processors
     • Made of several types of processors
     • They run sophisticated SW that must be carefully
       designed to obtain the most out of the multi-
       processor
5. Multiprocessor Architectures
• 5.1 Introduction
  – A multiprocessor is made of multiple processing
    elements (PEs)


    Processing    Processing     Processing
     Element       Element        Element


                                               Generic
            Interconnection network            Multiprocessor
                                               (MP)

      Memory        Memory            Memory
5. Multiprocessor Architectures
• 5.1 Introduction
  – An MP consists of 3 major subsystems
  1. Processing elements that operate on data
  2. Memory blocks that hold data values
  3. Interconnection networks between the PEs and
     memory
• In any MP design we have to decide
  – How many PEs to use
  – How much memory and how to divide it up
  – How rich the interconnection between the
    PEs and memory should be
5. Multiprocessor Architectures
• 5.1 Introduction
  – When designing an embedded multiprocessor the
    choices are varied and complex
     • SERVERS typically use symmetric MP built of identical PEs
       and uniform memory
         – This simplifies programming the machine
  – BUT ES designers will be ready to trade off
    some programming complexity for
    cost/performance/energy/power
     • => some additional variables
  – We can vary the types of PEs, they do not have
    to be of the same type
     • Different types of CPUs
     • Non –programmable PEs (perform only 1 function)
5. Multiprocessor Architectures

• 5.1 Introduction
  – We can use memory blocks of different sizes
     • Also we do not have to require that every PE
       access all memory
        – Using private memories that are shared by only a few
          PEs
        – Therefore the MEM performance is optimized for the
          units that use it
  – We can use specialized interconnection
    networks that provide only certain
    connections
5. Multiprocessor Architectures
• 5.1 Introduction
   – Embedded MPs
      • Make use of SIMD parallelism techniques
      • But MIMD architectures are the dominant
        mode of parallel machines in Embedded
        Computing
      • They tend to be heterogeneous (varied)
        PEs
      • Scientific MPs tend to be homogeneous
        parallel machines (copies of the same
        type of PEs)
5. Multiprocessor Architectures
• 5.2 Why Embedded Multiprocessors?
• MPs are commonly used for scientific and
  business servers, so why need them in
  embedded computing?
   – Because many of them actually have to
     support huge amounts of computation
   – The best way to meet those demands is to
     use MPs
     • This is particularly true when we must meet real-
       time constraints that are concerned with power
       consumption
5. Multiprocessor Architectures
• 5.2 Why Embedded Multiprocessors?
• Embedded MPs face more constraints than
  scientific processors do
  – Both intend to deliver high performance but
    Embedded Systems must do something in addition
     • They must provide real-time performance that is
       predictable
     • They often run at low energy and power levels
     • They have to be cost effective (i.e. provide high
       performance without using excessive amounts of
       HW)
5. Multiprocessor Architectures
• 5.2 Why Embedded Multiprocessors?
• The rigorous demands of embedded
  computing push us toward several design
  techniques
   – Heterogeneous microprocessors are often
     more energy-efficient and cost-effective
     than symmetric multiprocessors
   – Heterogeneous memory systems improve
     real-time performance
   – NoCs support heterogeneous
     architectures
5. Multiprocessor Architectures
• 5.2.1 Requirements on Embedded
  Systems
• Example: Computation in Cellular
  Telephones

  – A cellular telephone must perform a
    variety of functions that are basic to
    telephony
     • Compute and check error-correction codes
     • Perform voice compression and
       decompression
     • Respond to the protocol that governs
       communication with the cellular network
5. Multiprocessor Architectures
• 5.2.1 Requirements on Embedded Systems
• Example: Computation in Cellular Telephones
   – Furthermore, modern cell phones must perform a
     variety of other functions that are required by
     regulations or demanded by the marketplace
     • In US, cell phones must keep track of their position in
       case the user must be located for emergency services
        – A GPS is often used to find the phone’s position
     • Many cell phones play MP3 audio and also use MIDI or
       other methods to play music for ring tones
     • High-end cell phones provide cameras for still pictures
       and video
     • Cell phones may download application code from network
5. Multiprocessor Architectures
• 5.2.1 Requirements on Embedded Systems
• Example: Computation in Video Cameras
  – Video compression requires a great deal of computation,
    even for small images
  – Most video compression systems combine 3 basic methods
    to compress video
     • Lossless compression is used to reduce the size of the representation of the video
       data stream
     • Discrete cosine transform (DCT) is used to help quantize the images and reduce the
       size of the video stream by lossy encoding
     • Motion estimation and compensation allow the contents of one frame to be
       described in terms of motion from another frame
5. Multiprocessor Architectures
• 5.2.1 Requirements on Embedded Systems
• Example: Computation in Video Cameras
  – Most video compression systems combine the 3 basic
    methods to compress video
  – Of these 3, motion estimation is the most computationally
    intensive
     • Even an efficient motion estimation algorithm must perform a 16×16
       correlation at several points in the video frame, and if must be
       done for the entire frame
     • For a QCIF frame which is commonly used in cell phones, we
       have 176×144 pixels
        – That frame is divided into 11×9 of these 16×16 macroblocks for motion estimation

     • If we perform correlations for each macroblock
        – We will have to perform 11×9×16×16 = 25,344 pixel comparisons
        – All these calculations must be done on almost every frame, at
          a rate of 15 or 30 frames/second!!!
5. Multiprocessor Architectures
• 5.2.1 Requirements on Embedded Systems
• Example: Computation in Video Cameras
  – Most video compression systems combine 3 basic
    methods to compress video
  – Of these 3, motion estimation is the most
    computationally intensive
  – The DCT operator is also computationally intensive
      • Even efficient algorithms require a large number of
        multiplications to perform the 8×8 DCT that is commonly used
        in video and image compression
  – For example [Feig and Winograd] an algorithm for DCT uses 94
    multiplications and 454 additions to perform an 8×8 2-D DCT
  – This amounts to 148,896 multiplications per frame for a size
    frame with 1,584 blocks
5. Multiprocessor Architectures
• 5.2.2 Performance and Energy
  – Many embedded applications need lots of raw processing
    performance
     • But that is not enough, those computations have to be
       performed efficiently
  – [Austin et al. 2004] posed the embedded system
    performance problem as “mobile supercomputing”
     • Today’s PDA/Cell phones already perform a great deal
       of what once was considered as requiring large
       processors
        –   Speech recognition
        –   Video compression and recognition
        –   High-resolution graphics
        –   High-bandwidth wireless communication
5. Multiprocessor Architectures
• 5.2.2 Performance and Energy
   – [Austin et al.] estimate that a mobile
     supercomputing workload would require
     about 10,000 SPECint of performance
   – That means about 16× of that provided by a
     2GHz Intel Pentium IV processor
   – In the mobile environment, all this
     computation must be performed at very
     low energy
     • Battery power is growing at only 5%/year
5. Multiprocessor Architectures
• 5.2.2 Performance and Energy
   – Given that today’s highest-performance batteries
     have an energy density close to that of TNT
        – We may be close to the amount of energy that people are willing
          to carry with them




                               =
5. Multiprocessor Architectures
• 5.2.2 Performance and Energy
  – [Mudge et al.] estimate that to power the mobile
    supercomputer with a battery for 5 days, with it being used
    20% of the time
     • It must consume no more than 74 mW
     • Unfortunately, general-purpose processors do not meet those
       trends
  – Moore’s law: dictates that chip sizes double every
    18 months => circuits run faster
     • If we could make use of all the potential increase in speed, we
       could meet the 10,000 SPECint performance target
     • But trends show that we are not keeping up with performance
     • The performance of commercial processors and predicted trends
     • Traditional optimizations (pipelining, instruction-level
       parallelism) are becoming less effective (they have previously
       helped designers capture Moore’s law)
Performance trends for desktop processors
  [Austin et al.] IEEE Computer Society
5. Multiprocessor Architectures
• 5.2.2 Performance and Energy
  – [Mudge et al.] show that power consumption is
    getting worse
  – We need to reduce the energy consumption of the
    processor to use it in a mobile supercomputer!
     • But desktop processors consume more power with
       every new generation
  – Breaking away from these trends requires taking
    advantage of the characteristics of the problem
     • Adding units that are tuned to the core operations that we
       need to perform and
     • Eliminating HW that does not directly contribute to
       performance for this equation
  – By designing HW that meets its performance goals
    efficiently, we reduce system’s power consumption
Power consumption trends for desktop
      processors [Austin et al.]
5. Multiprocessor Architectures
• 5.2.2 Performance and Energy
  – One key advantage that embedded system architects can
    leverage is task-level parallelism
     • Many embedded applications neatly divide into several tasks or
       phases that communicate with each other
     • Which is a natural and easily exploitable source of parallelism
  – Desktop processors rely on instruction-level
    parallelism (ILP) to improve performance
     • But only a small amount of ILP is available in most
        programs
  – We can build custom multiprocessor architectures that
    reflect the task-level parallelism available in the application
     • And meet performance targets at much lower cost and
        with much less energy
5. Multiprocessor Architectures
• 5.2.3 Specialization and Multiprocessors
  – It is the combination of high performance, low power, and
    real-time that drives us to use multiprocessors (MPs)
  – And these requirements lead us further toward
    heterogeneous processors
     • Which starkly contrast with the symmetric multi-processors used
       for scientific computation
  – Multiprocessing Vs. Uniprocessing
     • Even if we build a multiprocessor out of several copies of the same
       type of CPU
         – We may end up with a more efficient system than if we used a
            uni-processor
     • The manufacturing cost of a microprocessor is a non-linear function
       of clock speed
         – Customers pay considerably more for modest increases in clock
            speed
5. Multiprocessor Architectures
• 5.2.3 Specialization and Multiprocessors
  – Real Time & Multiprocessing
     • Real-time requirements also lead to multiprocessing
     • When we put several real-time processes on the same CPU, they
       compete for cycles
     • But we cannot be sure that we can use 100% of the CPU if we want
       to meet real-time deadlines
     • Furthermore, we must pay for those reserved cycles at the nonlinear
       rate of higher clock speed
  – Multiprocessing & Accelerators
     • The next step beyond symmetric microprocessors is heterogeneous
       multiprocessors
     • We can specialize all aspects of the multiprocessor: the PEs, the
       memory, and the interconnection network
     • Specializations understandably lead to lower power consumption;
       perhaps less intuitively, they can also improve real-time behavior
5. Multiprocessor Architectures
• 5.2.3 Specialization and Multiprocessors
  – Specialization
     • The following parts of embedded systems lend themselves to
       specialized implementations
         – Some operations, particularly those defined by standards, are not
           likely to change
              » The 8×8 DCT, for example, has become widely used well
                beyond its original function in JPEG
              » Given the frequency and variety of its uses, it is worthwhile to
                optimize not just the DCT, but in particular its 8×8 form
         – Some functions require operations that do not map well onto a
           CPU’s data operations
              » The mismatch may be due to several reasons
              » For instance, bit-level operations are difficult to perform
                efficiently on some CPUs
              » The operations may require too many registers
              » We can design either a specialized CPU or a special-purpose
                HW unit to perform these functions
5. Multiprocessor Architectures
• 5.2.3 Specialization and Multiprocessors
  – Specialization
     • The following parts of embedded systems lend themselves to
       specialized implementations
         – Highly responsive I/O operations may be best performed by an
           accelerator with an attached I/O unit
         – If data must be read, processed, written to meet a tight deadline
         – For example, (in engine control) a dedicated HW unit may be
           more efficient than a CPU
  – Cost Vs. Power
     • Heterogeneity reduces power consumption: it removes unnecessary
       HW
     • The additional HW required to generalize functions adds to both
       dynamic and static power dissipation
     • Excessive specialization can add so much communication cost that
       the energy gain from specialization is lost
     • However, specializing the right functions can lead to big energy
       savings
5. Multiprocessor Architectures
• 5.2.3 Specialization and Multiprocessors
  – Real-Time Performance
     • In addition to reducing costs, using multiple CPUs
       can help with real-time performance
     • We can often meet deadlines and be responsive to
       interaction much more easily when we put those
       time-critical processes on separate CPUs
     • Specialized memory systems and interconnects
       also help make the response time of a process
       more predictable
5. Multiprocessor Architectures
• 5.2.4 Flexibility and Efficiency
   – Use HW and SW
      • Many embedded systems perform complex
        functions that would be too difficult to implement
        entirely in HW
      • Translating all the standards to HW may be too
        time-consuming and expensive
      • Multiple standards encourage SW implementation
         – For ex. must be able to play audio data in many different
           formats: MP3, Dolby Digital, Ogg Vorbis, etc.
         – These standards perform some similar operations but
           cannot be easily collapsed into a few key HW units
         – The reasonable choice: processors running SW, aided
           by a few key HW units
5. Multiprocessor Architectures
• 5.3 Multiprocessor Design Techniques
  – We discuss embedded multiprocessor design
    methodologies in detail
  – 5.3.1 Multiprocessor Design Methodologies
     • The design of embedded multiprocessors is data-driven
       and relies on analyzing programs
     • We call these programs the workload, in contrast with the
       term benchmark commonly used in computer architecture
     • Embedded systems operate under real-time
       constraints and overall throughput
        – Therefore we often use a sample set of applications to
          evaluate overall system performance
        – These programs may not be the exact code run on the
          final system and the final system may have many modes
        – But using workloads is still useful and very important
5. Multiprocessor Architectures
• 5.3 Multiprocessor Design Techniques
   – 5.3.1 Multiprocessor Design Methodologies
      • Benchmarks are generally treated as
        independent entities
      • While embedded multiprocessor design
        requires evaluating the interaction between
        programs
      • The workload, in fact, includes data
        inputs as well as the programs
        themselves
5. Multiprocessor Architectures
• 5.3 Multiprocessor Design Techniques
   – 5.3.1 Multiprocessor Design Methodologies
       • Multiprocessor-based embedded system design methodology

                                  Operation
      Workload                    counts, etc


                                                    PE, memory,
                                    Platform
Platform-independent                                interconnect
                                     design
    optimizations                                   design


                               Platform-dependent
Platform-independent
                                  optimizations
   measurements

                                 Implementation
5. Multiprocessor Architectures
• 5.3 Multiprocessor Design Techniques
   – 5.3.1 Multiprocessor Design Methodologies
     • This workflow includes both the design of the HW
       platform and the SW that runs on the platform
     • Before the workload is used to evaluate the
       architecture, it generally must be put into good shape
       with platform-independent optimizations
     • Many programs are not written with embedded
       platform restrictions, real-time performance or low
       power in mind
     • Using programs designed to work in non-real-time
       mode with unlimited main memory can often lead to
       bad architectural decisions
5. Multiprocessor Architectures
• 5.3 Multiprocessor Design Techniques
   – 5.3.1 Multiprocessor Design Methodologies
     • Once we have the workload programs in shape, we
       can perform simple experiments before defining an
       architecture
        – To obtain platform-independent measurements
     • Simple measurements, such as dynamic instruction
       count and data access patterns, provide valuable
       information about the nature of the workload
     • Using these platform-independent metrics, we can
       identify an initial candidate architecture
        – If the platform relies on static allocation, we may need to
          map the workload programs onto the platform
        – We then measure platform-dependent characteristics
5. Multiprocessor Architectures
• 5.3 Multiprocessor Design Techniques
   – 5.3.1 Multiprocessor Design Methodologies
     • Based on these characteristics, we evaluate the
       architecture, using both numerical measures and
       judgment
     • If the platform is satisfactory, then we are finished
     • If not, we modify the platform and make a new round
       of measurements
     • Along the way, we need to design the components
       of the multiprocessor
        – The processing elements,
        – The memory system, and
        – The interconnects
5. Multiprocessor Architectures
• 5.3 Multiprocessor Design Techniques
   – 5.3.1 Multiprocessor Design Methodologies
     • Once we are satisfied with the platform
        – We can map the SW onto the platform
        – During that process
            » We may be aided by libraries of code and
            » Compilers
        – Most of the optimizations performed at this phase should
          be platform-specific
            » We must allocate operations to processing elements
            » Allocate data to memories
            » Allocate Communications to links
            » We now also have to determine when things happen
5. Multiprocessor Architectures
• 5.3 Multiprocessor Design Techniques
   – 5.3.2 Multiprocessor Modeling and Simulation
      • [Cai & Gajski] defined a hierarchy of modeling methods for digital
         systems and compared their characteristics
                  Communication    Computation      Communication      PE
                  time             time             scheme             Interface
 Specification    No               No               Variable           No PEs

 Component        No               Approximate      Variable channel   Abstract
 (PE) assembly
 Bus              Approximate      Approximate      Abstract bus       Abstract
 arbitration                                        channel
 Bus              Cycle accurate   Approximate      Protocol bus       Abstract
 functional                                         channel
 Cycle            Approximate      Cycle accurate   Abstract bus       Pin
 accurate                                           channel            accurate
 Implementation   Cycle accurate   Cycle accurate   Wires              Pin
                                                                       accurate
5. Multiprocessor Architectures
• 5.3 Multiprocessor Design Techniques
   – 5.3.2 Multiprocessor Modeling and Simulation
     • Most multiProc simulators are systems of
       communicating simulators
     • The component simulators represent CPUs,
       memory elements, and routing networks
     • The multiProc simulator itself negotiates
       communication between those component simulators
     • We can use the techniques of parallel computing to
       build the multiProc simulator
     • Each component simulator is a process, both in the
       simulation metaphor and literally as a process running
       on the host CPU’s operating system
5. Multiprocessor Architectures
• 5.3 Multiprocessor Design Techniques
   – 5.3.2 Multiprocessor Modeling and Simulation
     • Consider the simulation of a write from a PE to a ME
       (memory element)
     • The PE and ME are each component simulators
       that run as processes on the host CPU
     • The WRITE operation requires a message from the
       PE simulator to the ME simulator

         PE                                     ME
      Simulator    Message( write address,   Simulator
                   data to be written)
5. Multiprocessor Architectures
• 5.3 Multiprocessor Design Techniques
   – 5.3.2 Multiprocessor Modeling and Simulation
     • The MultiProc simulator must route that message
       by determining which simulation process is
       responsible for the address of the write operation
     • After performing the required mapping, it sends a
       message to the ME simulator, asking it to perform
       the write
     • Most MultiProc simulators are assuming
       homogeneous MP architectures, and use that
       assumption to build simulation shortcuts
        – However, many Embedded MPs are heterogeneous, and
          therefore cannot use these optimizations
5. Multiprocessor Architectures
• 5.3 Multiprocessor Design Techniques
   – 5.3.2 Multiprocessor Modeling and Simulation
     • SystemC (http://www.systemc.org) is a widely used
       framework for transaction-level design of
       heterogeneous MultiProcs
     • It is designed to facilitate the simulation of
       heterogeneous architectures built from combinations
       of hardwired blocks and programmable processors
     • SystemC is built on top of C++
          – Defines a set of classes used to describe the
            system being simulated
          – A simulation manager guides the execution of the
            simulator

Más contenido relacionado

La actualidad más candente

Lecture 1
Lecture 1Lecture 1
Lecture 1Mr SMAK
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsSabidur Rahman
 
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSFAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSMaurvi04
 
Research Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel ProgrammingResearch Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel ProgrammingShitalkumar Sukhdeve
 
Reduce course notes class xi
Reduce course notes class xiReduce course notes class xi
Reduce course notes class xiSyed Zaid Irshad
 
Introduction To Parallel Computing
Introduction To Parallel ComputingIntroduction To Parallel Computing
Introduction To Parallel ComputingJörn Dinkla
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computingMehul Patel
 
Unit IV Memory and I/O Organization
Unit IV Memory and I/O OrganizationUnit IV Memory and I/O Organization
Unit IV Memory and I/O OrganizationBalaji Vignesh
 
Unit 1 Computer organization and Instructions
Unit 1 Computer organization and InstructionsUnit 1 Computer organization and Instructions
Unit 1 Computer organization and InstructionsBalaji Vignesh
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel ComputingAkhila Prabhakaran
 
Indian Contribution towards Parallel Processing
Indian Contribution towards Parallel ProcessingIndian Contribution towards Parallel Processing
Indian Contribution towards Parallel ProcessingAjil Jose
 
Parallel & Distributed processing
Parallel & Distributed processingParallel & Distributed processing
Parallel & Distributed processingSyed Zaid Irshad
 
Learn about computer hardware and software
Learn about computer hardware and softwareLearn about computer hardware and software
Learn about computer hardware and softwarefarrukh ishaq choudhary
 

La actualidad más candente (20)

Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithms
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSFAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
 
Research Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel ProgrammingResearch Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel Programming
 
Reduce course notes class xi
Reduce course notes class xiReduce course notes class xi
Reduce course notes class xi
 
Introduction To Parallel Computing
Introduction To Parallel ComputingIntroduction To Parallel Computing
Introduction To Parallel Computing
 
Parallel Computing
Parallel Computing Parallel Computing
Parallel Computing
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computing
 
Unit IV Memory and I/O Organization
Unit IV Memory and I/O OrganizationUnit IV Memory and I/O Organization
Unit IV Memory and I/O Organization
 
CS6303 - Computer Architecture
CS6303 - Computer ArchitectureCS6303 - Computer Architecture
CS6303 - Computer Architecture
 
Unit 1 Computer organization and Instructions
Unit 1 Computer organization and InstructionsUnit 1 Computer organization and Instructions
Unit 1 Computer organization and Instructions
 
Lecture02 types
Lecture02 typesLecture02 types
Lecture02 types
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
Indian Contribution towards Parallel Processing
Indian Contribution towards Parallel ProcessingIndian Contribution towards Parallel Processing
Indian Contribution towards Parallel Processing
 
Parallel & Distributed processing
Parallel & Distributed processingParallel & Distributed processing
Parallel & Distributed processing
 
Computer performance
Computer performanceComputer performance
Computer performance
 
parallel processing
parallel processingparallel processing
parallel processing
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Learn about computer hardware and software
Learn about computer hardware and softwareLearn about computer hardware and software
Learn about computer hardware and software
 

Similar a Hwswcd mp so_c_1

CSA unit5.pptx
CSA unit5.pptxCSA unit5.pptx
CSA unit5.pptxAbcvDef
 
Sample Solution Blueprint
Sample Solution BlueprintSample Solution Blueprint
Sample Solution BlueprintMike Alvarado
 
Distributed Convex Optimization Thesis - Behroz Sikander
Distributed Convex Optimization Thesis - Behroz SikanderDistributed Convex Optimization Thesis - Behroz Sikander
Distributed Convex Optimization Thesis - Behroz Sikanderrogerz1234567
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Ankit Gupta
 
P-DC-8-24102023-085824am (1).pptx
P-DC-8-24102023-085824am (1).pptxP-DC-8-24102023-085824am (1).pptx
P-DC-8-24102023-085824am (1).pptxYasirShaikh34
 
Training netbackup6x2
Training netbackup6x2Training netbackup6x2
Training netbackup6x2M Shariff
 
Operating systems 1
Operating systems 1Operating systems 1
Operating systems 1JoshuaIgo
 
Neuromophic device for Automotive
Neuromophic device for AutomotiveNeuromophic device for Automotive
Neuromophic device for AutomotiveYoshifumi Sakamoto
 
CAD theory presentation.pptx .
CAD theory presentation.pptx                .CAD theory presentation.pptx                .
CAD theory presentation.pptx .Athar739197
 
1. An Introduction to Embed Systems_DRKG.pptx
1. An Introduction to Embed Systems_DRKG.pptx1. An Introduction to Embed Systems_DRKG.pptx
1. An Introduction to Embed Systems_DRKG.pptxKesavanGopal1
 
Backup Exec Blueprints▶ Deduplication
Backup Exec Blueprints▶ DeduplicationBackup Exec Blueprints▶ Deduplication
Backup Exec Blueprints▶ DeduplicationSymantec
 
Basic Structure of Computers: Functional Units, Basic Operational Concepts, B...
Basic Structure of Computers: Functional Units, Basic Operational Concepts, B...Basic Structure of Computers: Functional Units, Basic Operational Concepts, B...
Basic Structure of Computers: Functional Units, Basic Operational Concepts, B...Abhishekn84
 
Cloud Computing with InduSoft
Cloud Computing with InduSoftCloud Computing with InduSoft
Cloud Computing with InduSoftAVEVA
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...Edge AI and Vision Alliance
 
Mot so cau hoi ve may tinh
Mot so cau hoi ve may tinhMot so cau hoi ve may tinh
Mot so cau hoi ve may tinhQuoc Nguyen
 
Unit-1_Digital Computers, number systemCOA[1].pptx
Unit-1_Digital Computers, number systemCOA[1].pptxUnit-1_Digital Computers, number systemCOA[1].pptx
Unit-1_Digital Computers, number systemCOA[1].pptxVanshJain322212
 

Similar a Hwswcd mp so_c_1 (20)

CSA unit5.pptx
CSA unit5.pptxCSA unit5.pptx
CSA unit5.pptx
 
Sample Solution Blueprint
Sample Solution BlueprintSample Solution Blueprint
Sample Solution Blueprint
 
Distributed Convex Optimization Thesis - Behroz Sikander
Distributed Convex Optimization Thesis - Behroz SikanderDistributed Convex Optimization Thesis - Behroz Sikander
Distributed Convex Optimization Thesis - Behroz Sikander
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)
 
P-DC-8-24102023-085824am (1).pptx
P-DC-8-24102023-085824am (1).pptxP-DC-8-24102023-085824am (1).pptx
P-DC-8-24102023-085824am (1).pptx
 
Training netbackup6x2
Training netbackup6x2Training netbackup6x2
Training netbackup6x2
 
Lecture1
Lecture1Lecture1
Lecture1
 
Operating systems 1
Operating systems 1Operating systems 1
Operating systems 1
 
Ubiquisys at Femtocells Americas 11
Ubiquisys at Femtocells Americas 11Ubiquisys at Femtocells Americas 11
Ubiquisys at Femtocells Americas 11
 
Array Processor
Array ProcessorArray Processor
Array Processor
 
Aca module 1
Aca module 1Aca module 1
Aca module 1
 
Neuromophic device for Automotive
Neuromophic device for AutomotiveNeuromophic device for Automotive
Neuromophic device for Automotive
 
CAD theory presentation.pptx .
CAD theory presentation.pptx                .CAD theory presentation.pptx                .
CAD theory presentation.pptx .
 
1. An Introduction to Embed Systems_DRKG.pptx
1. An Introduction to Embed Systems_DRKG.pptx1. An Introduction to Embed Systems_DRKG.pptx
1. An Introduction to Embed Systems_DRKG.pptx
 
Backup Exec Blueprints▶ Deduplication
Backup Exec Blueprints▶ DeduplicationBackup Exec Blueprints▶ Deduplication
Backup Exec Blueprints▶ Deduplication
 
Basic Structure of Computers: Functional Units, Basic Operational Concepts, B...
Basic Structure of Computers: Functional Units, Basic Operational Concepts, B...Basic Structure of Computers: Functional Units, Basic Operational Concepts, B...
Basic Structure of Computers: Functional Units, Basic Operational Concepts, B...
 
Cloud Computing with InduSoft
Cloud Computing with InduSoftCloud Computing with InduSoft
Cloud Computing with InduSoft
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
 
Mot so cau hoi ve may tinh
Mot so cau hoi ve may tinhMot so cau hoi ve may tinh
Mot so cau hoi ve may tinh
 
Unit-1_Digital Computers, number systemCOA[1].pptx
Unit-1_Digital Computers, number systemCOA[1].pptxUnit-1_Digital Computers, number systemCOA[1].pptx
Unit-1_Digital Computers, number systemCOA[1].pptx
 

Último

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Último (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Hwswcd mp so_c_1

  • 1. Hardware/Software Co-Design Lecture MPSoC 1
  • 2. 5. Multiprocessor Architectures • 5.1 Introduction – The focus is on embedded microprocessors study – Multiprocessing (MP) is very common in embedded computing because • Allows us to meet our performance, cost and energy/power consumption goals – Embedded MP are often heterogeneous multi-processors • Made of several types of processors • They run sophisticated SW that must be carefully designed to obtain the most out of the multi- processor
  • 3. 5. Multiprocessor Architectures • 5.1 Introduction – A multiprocessor is made of multiple processing elements (PEs) Processing Processing Processing Element Element Element Generic Interconnection network Multiprocessor (MP) Memory Memory Memory
  • 4. 5. Multiprocessor Architectures • 5.1 Introduction – An MP consists of 3 major subsystems 1. Processing elements that operate on data 2. Memory blocks that hold data values 3. Interconnection networks between the PEs and memory • In any MP design we have to decide – How many PEs to use – How much memory and how to divide it up – How rich the interconnection between the PEs and memory should be
  • 5. 5. Multiprocessor Architectures • 5.1 Introduction – When designing an embedded multiprocessor the choices are varied and complex • SERVERS typically use symmetric MP built of identical PEs and uniform memory – This simplifies programming the machine – BUT ES designers will be ready to trade off some programming complexity for cost/performance/energy/power • => some additional variables – We can vary the types of PEs, they do not have to be of the same type • Different types of CPUs • Non –programmable PEs (perform only 1 function)
  • 6. 5. Multiprocessor Architectures • 5.1 Introduction – We can use memory blocks of different sizes • Also we do not have to require that every PE access all memory – Using private memories that are shared by only a few PEs – Therefore the MEM performance is optimized for the units that use it – We can use specialized interconnection networks that provide only certain connections
  • 7. 5. Multiprocessor Architectures • 5.1 Introduction – Embedded MPs • Make use of SIMD parallelism techniques • But MIMD architectures are the dominant mode of parallel machines in Embedded Computing • They tend to be heterogeneous (varied) PEs • Scientific MPs tend to be homogeneous parallel machines (copies of the same type of PEs)
  • 8. 5. Multiprocessor Architectures • 5.2 Why Embedded Multiprocessors? • MPs are commonly used for scientific and business servers, so why need them in embedded computing? – Because many of them actually have to support huge amounts of computation – The best way to meet those demands is to use MPs • This is particularly true when we must meet real- time constraints that are concerned with power consumption
  • 9. 5. Multiprocessor Architectures • 5.2 Why Embedded Multiprocessors? • Embedded MPs face more constraints than scientific processors do – Both intend to deliver high performance but Embedded Systems must do something in addition • They must provide real-time performance that is predictable • They often run at low energy and power levels • They have to be cost effective (i.e. provide high performance without using excessive amounts of HW)
  • 10. 5. Multiprocessor Architectures • 5.2 Why Embedded Multiprocessors? • The rigorous demands of embedded computing push us toward several design techniques – Heterogeneous microprocessors are often more energy-efficient and cost-effective than symmetric multiprocessors – Heterogeneous memory systems improve real-time performance – NoCs support heterogeneous architectures
  • 11. 5. Multiprocessor Architectures • 5.2.1 Requirements on Embedded Systems • Example: Computation in Cellular Telephones – A cellular telephone must perform a variety of functions that are basic to telephony • Compute and check error-correction codes • Perform voice compression and decompression • Respond to the protocol that governs communication with the cellular network
  • 12. 5. Multiprocessor Architectures • 5.2.1 Requirements on Embedded Systems • Example: Computation in Cellular Telephones – Furthermore, modern cell phones must perform a variety of other functions that are required by regulations or demanded by the marketplace • In US, cell phones must keep track of their position in case the user must be located for emergency services – A GPS is often used to find the phone’s position • Many cell phones play MP3 audio and also use MIDI or other methods to play music for ring tones • High-end cell phones provide cameras for still pictures and video • Cell phones may download application code from network
  • 13. 5. Multiprocessor Architectures • 5.2.1 Requirements on Embedded Systems • Example: Computation in Video Cameras – Video compression requires a great deal of computation, even for small images – Most video compression systems combine 3 basic methods to compress video • Lossless compression is used to reduce the size of the representation of the video data stream • Discrete cosine transform (DCT) is used to help quantize the images and reduce the size of the video stream by lossy encoding • Motion estimation and compensation allow the contents of one frame to be described in terms of motion from another frame
  • 14. 5. Multiprocessor Architectures • 5.2.1 Requirements on Embedded Systems • Example: Computation in Video Cameras – Most video compression systems combine the 3 basic methods to compress video – Of these 3, motion estimation is the most computationally intensive • Even an efficient motion estimation algorithm must perform a 16×16 correlation at several points in the video frame, and if must be done for the entire frame • For a QCIF frame which is commonly used in cell phones, we have 176×144 pixels – That frame is divided into 11×9 of these 16×16 macroblocks for motion estimation • If we perform correlations for each macroblock – We will have to perform 11×9×16×16 = 25,344 pixel comparisons – All these calculations must be done on almost every frame, at a rate of 15 or 30 frames/second!!!
  • 15. 5. Multiprocessor Architectures • 5.2.1 Requirements on Embedded Systems • Example: Computation in Video Cameras – Most video compression systems combine 3 basic methods to compress video – Of these 3, motion estimation is the most computationally intensive – The DCT operator is also computationally intensive • Even efficient algorithms require a large number of multiplications to perform the 8×8 DCT that is commonly used in video and image compression – For example [Feig and Winograd] an algorithm for DCT uses 94 multiplications and 454 additions to perform an 8×8 2-D DCT – This amounts to 148,896 multiplications per frame for a size frame with 1,584 blocks
  • 16. 5. Multiprocessor Architectures • 5.2.2 Performance and Energy – Many embedded applications need lots of raw processing performance • But that is not enough, those computations have to be performed efficiently – [Austin et al. 2004] posed the embedded system performance problem as “mobile supercomputing” • Today’s PDA/Cell phones already perform a great deal of what once was considered as requiring large processors – Speech recognition – Video compression and recognition – High-resolution graphics – High-bandwidth wireless communication
  • 17. 5. Multiprocessor Architectures • 5.2.2 Performance and Energy – [Austin et al.] estimate that a mobile supercomputing workload would require about 10,000 SPECint of performance – That means about 16× of that provided by a 2GHz Intel Pentium IV processor – In the mobile environment, all this computation must be performed at very low energy • Battery power is growing at only 5%/year
  • 18. 5. Multiprocessor Architectures • 5.2.2 Performance and Energy – Given that today’s highest-performance batteries have an energy density close to that of TNT – We may be close to the amount of energy that people are willing to carry with them =
  • 19. 5. Multiprocessor Architectures • 5.2.2 Performance and Energy – [Mudge et al.] estimate that to power the mobile supercomputer with a battery for 5 days, with it being used 20% of the time • It must consume no more than 74 mW • Unfortunately, general-purpose processors do not meet those trends – Moore’s law: dictates that chip sizes double every 18 months => circuits run faster • If we could make use of all the potential increase in speed, we could meet the 10,000 SPECint performance target • But trends show that we are not keeping up with performance • The performance of commercial processors and predicted trends • Traditional optimizations (pipelining, instruction-level parallelism) are becoming less effective (they have previously helped designers capture Moore’s law)
  • 20. Performance trends for desktop processors [Austin et al.] IEEE Computer Society
  • 21. 5. Multiprocessor Architectures • 5.2.2 Performance and Energy – [Mudge et al.] show that power consumption is getting worse – We need to reduce the energy consumption of the processor to use it in a mobile supercomputer! • But desktop processors consume more power with every new generation – Breaking away from these trends requires taking advantage of the characteristics of the problem • Adding units that are tuned to the core operations that we need to perform and • Eliminating HW that does not directly contribute to performance for this equation – By designing HW that meets its performance goals efficiently, we reduce system’s power consumption
  • 22. Power consumption trends for desktop processors [Austin et al.]
  • 23. 5. Multiprocessor Architectures • 5.2.2 Performance and Energy – One key advantage that embedded system architects can leverage is task-level parallelism • Many embedded applications neatly divide into several tasks or phases that communicate with each other • Which is a natural and easily exploitable source of parallelism – Desktop processors rely on instruction-level parallelism (ILP) to improve performance • But only a small amount of ILP is available in most programs – We can build custom multiprocessor architectures that reflect the task-level parallelism available in the application • And meet performance targets at much lower cost and with much less energy
  • 24. 5. Multiprocessor Architectures • 5.2.3 Specialization and Multiprocessors – It is the combination of high performance, low power, and real-time that drives us to use multiprocessors (MPs) – And these requirements lead us further toward heterogeneous processors • Which starkly contrast with the symmetric multi-processors used for scientific computation – Multiprocessing Vs. Uniprocessing • Even if we build a multiprocessor out of several copies of the same type of CPU – We may end up with a more efficient system than if we used a uni-processor • The manufacturing cost of a microprocessor is a non-linear function of clock speed – Customers pay considerably more for modest increases in clock speed
  • 25. 5. Multiprocessor Architectures • 5.2.3 Specialization and Multiprocessors – Real Time & Multiprocessing • Real-time requirements also lead to multiprocessing • When we put several real-time processes on the same CPU, they compete for cycles • But we cannot be sure that we can use 100% of the CPU if we want to meet real-time deadlines • Furthermore, we must pay for those reserved cycles at the nonlinear rate of higher clock speed – Multiprocessing & Accelerators • The next step beyond symmetric microprocessors is heterogeneous multiprocessors • We can specialize all aspects of the multiprocessor: the PEs, the memory, and the interconnection network • Specializations understandably lead to lower power consumption; perhaps less intuitively, they can also improve real-time behavior
  • 26. 5. Multiprocessor Architectures • 5.2.3 Specialization and Multiprocessors – Specialization • The following parts of embedded systems lend themselves to specialized implementations – Some operations, particularly those defined by standards, are not likely to change » The 8×8 DCT, for example, has become widely used well beyond its original function in JPEG » Given the frequency and variety of its uses, it is worthwhile to optimize not just the DCT, but in particular its 8×8 form – Some functions require operations that do not map well onto a CPU’s data operations » The mismatch may be due to several reasons » For instance, bit-level operations are difficult to perform efficiently on some CPUs » The operations may require too many registers » We can design either a specialized CPU or a special-purpose HW unit to perform these functions
  • 27. 5. Multiprocessor Architectures • 5.2.3 Specialization and Multiprocessors – Specialization • The following parts of embedded systems lend themselves to specialized implementations – Highly responsive I/O operations may be best performed by an accelerator with an attached I/O unit – If data must be read, processed, written to meet a tight deadline – For example, (in engine control) a dedicated HW unit may be more efficient than a CPU – Cost Vs. Power • Heterogeneity reduces power consumption: it removes unnecessary HW • The additional HW required to generalize functions adds to both dynamic and static power dissipation • Excessive specialization can add so much communication cost that the energy gain from specialization is lost • However, specializing the right functions can lead to big energy savings
  • 28. 5. Multiprocessor Architectures • 5.2.3 Specialization and Multiprocessors – Real-Time Performance • In addition to reducing costs, using multiple CPUs can help with real-time performance • We can often meet deadlines and be responsive to interaction much more easily when we put those time-critical processes on separate CPUs • Specialized memory systems and interconnects also help make the response time of a process more predictable
  • 29. 5. Multiprocessor Architectures • 5.2.4 Flexibility and Efficiency – Use HW and SW • Many embedded systems perform complex functions that would be too difficult to implement entirely in HW • Translating all the standards to HW may be too time-consuming and expensive • Multiple standards encourage SW implementation – For ex. must be able to play audio data in many different formats: MP3, Dolby Digital, Ogg Vorbis, etc. – These standards perform some similar operations but cannot be easily collapsed into a few key HW units – The reasonable choice: processors running SW, aided by a few key HW units
  • 30. 5. Multiprocessor Architectures • 5.3 Multiprocessor Design Techniques – We discuss embedded multiprocessor design methodologies in detail – 5.3.1 Multiprocessor Design Methodologies • The design of embedded multiprocessors is data-driven and relies on analyzing programs • We call these programs the workload, in contrast with the term benchmark commonly used in computer architecture • Embedded systems operate under real-time constraints and overall throughput – Therefore we often use a sample set of applications to evaluate overall system performance – These programs may not be the exact code run on the final system and the final system may have many modes – But using workloads is still useful and very important
  • 31. 5. Multiprocessor Architectures • 5.3 Multiprocessor Design Techniques – 5.3.1 Multiprocessor Design Methodologies • Benchmarks are generally treated as independent entities • While embedded multiprocessor design requires evaluating the interaction between programs • The workload, in fact, includes data inputs as well as the programs themselves
  • 32. 5. Multiprocessor Architectures • 5.3 Multiprocessor Design Techniques – 5.3.1 Multiprocessor Design Methodologies • Multiprocessor-based embedded system design methodology Operation Workload counts, etc PE, memory, Platform Platform-independent interconnect design optimizations design Platform-dependent Platform-independent optimizations measurements Implementation
  • 33. 5. Multiprocessor Architectures • 5.3 Multiprocessor Design Techniques – 5.3.1 Multiprocessor Design Methodologies • This workflow includes both the design of the HW platform and the SW that runs on the platform • Before the workload is used to evaluate the architecture, it generally must be put into good shape with platform-independent optimizations • Many programs are not written with embedded platform restrictions, real-time performance or low power in mind • Using programs designed to work in non-real-time mode with unlimited main memory can often lead to bad architectural decisions
  • 34. 5. Multiprocessor Architectures • 5.3 Multiprocessor Design Techniques – 5.3.1 Multiprocessor Design Methodologies • Once we have the workload programs in shape, we can perform simple experiments before defining an architecture – To obtain platform-independent measurements • Simple measurements, such as dynamic instruction count and data access patterns, provide valuable information about the nature of the workload • Using these platform-independent metrics, we can identify an initial candidate architecture – If the platform relies on static allocation, we may need to map the workload programs onto the platform – We then measure platform-dependent characteristics
  • 35. 5. Multiprocessor Architectures • 5.3 Multiprocessor Design Techniques – 5.3.1 Multiprocessor Design Methodologies • Based on these characteristics, we evaluate the architecture, using both numerical measures and judgment • If the platform is satisfactory, then we are finished • If not, we modify the platform and make a new round of measurements • Along the way, we need to design the components of the multiprocessor – The processing elements, – The memory system, and – The interconnects
  • 36. 5. Multiprocessor Architectures • 5.3 Multiprocessor Design Techniques – 5.3.1 Multiprocessor Design Methodologies • Once we are satisfied with the platform – We can map the SW onto the platform – During that process » We may be aided by libraries of code and » Compilers – Most of the optimizations performed at this phase should be platform-specific » We must allocate operations to processing elements » Allocate data to memories » Allocate Communications to links » We now also have to determine when things happen
  • 37. 5. Multiprocessor Architectures • 5.3 Multiprocessor Design Techniques – 5.3.2 Multiprocessor Modeling and Simulation • [Cai & Gajski] defined a hierarchy of modeling methods for digital systems and compared their characteristics Communication Computation Communication PE time time scheme Interface Specification No No Variable No PEs Component No Approximate Variable channel Abstract (PE) assembly Bus Approximate Approximate Abstract bus Abstract arbitration channel Bus Cycle accurate Approximate Protocol bus Abstract functional channel Cycle Approximate Cycle accurate Abstract bus Pin accurate channel accurate Implementation Cycle accurate Cycle accurate Wires Pin accurate
  • 38. 5. Multiprocessor Architectures • 5.3 Multiprocessor Design Techniques – 5.3.2 Multiprocessor Modeling and Simulation • Most multiProc simulators are systems of communicating simulators • The component simulators represent CPUs, memory elements, and routing networks • The multiProc simulator itself negotiates communication between those component simulators • We can use the techniques of parallel computing to build the multiProc simulator • Each component simulator is a process, both in the simulation metaphor and literally as a process running on the host CPU’s operating system
  • 39. 5. Multiprocessor Architectures • 5.3 Multiprocessor Design Techniques – 5.3.2 Multiprocessor Modeling and Simulation • Consider the simulation of a write from a PE to a ME (memory element) • The PE and ME are each component simulators that run as processes on the host CPU • The WRITE operation requires a message from the PE simulator to the ME simulator PE ME Simulator Message( write address, Simulator data to be written)
  • 40. 5. Multiprocessor Architectures • 5.3 Multiprocessor Design Techniques – 5.3.2 Multiprocessor Modeling and Simulation • The MultiProc simulator must route that message by determining which simulation process is responsible for the address of the write operation • After performing the required mapping, it sends a message to the ME simulator, asking it to perform the write • Most MultiProc simulators are assuming homogeneous MP architectures, and use that assumption to build simulation shortcuts – However, many Embedded MPs are heterogeneous, and therefore cannot use these optimizations
  • 41. 5. Multiprocessor Architectures • 5.3 Multiprocessor Design Techniques – 5.3.2 Multiprocessor Modeling and Simulation • SystemC (http://www.systemc.org) is a widely used framework for transaction-level design of heterogeneous MultiProcs • It is designed to facilitate the simulation of heterogeneous architectures built from combinations of hardwired blocks and programmable processors • SystemC is built on top of C++ – Defines a set of classes used to describe the system being simulated – A simulation manager guides the execution of the simulator