Hwswcd mp so_c_1

Hardware/Software Co-Design

Lecture MPSoC 1

5. Multiprocessor Architectures
• 5.1 Introduction
– The focus is on embedded microprocessors study
– Multiprocessing (MP) is very common in embedded
computing because
• Allows us to meet our performance, cost and energy/power
consumption goals
– Embedded MP are often heterogeneous
multi-processors
• Made of several types of processors
• They run sophisticated SW that must be carefully
designed to obtain the most out of the multi-
processor

– A multiprocessor is made of multiple processing
elements (PEs)

Processing Processing Processing
Element Element Element

Generic
Interconnection network Multiprocessor
(MP)

Memory Memory Memory

– An MP consists of 3 major subsystems
1. Processing elements that operate on data
2. Memory blocks that hold data values
3. Interconnection networks between the PEs and
memory
• In any MP design we have to decide
– How many PEs to use
– How much memory and how to divide it up
– How rich the interconnection between the
PEs and memory should be

– When designing an embedded multiprocessor the
choices are varied and complex
• SERVERS typically use symmetric MP built of identical PEs
and uniform memory
– This simplifies programming the machine
– BUT ES designers will be ready to trade off
some programming complexity for
cost/performance/energy/power
• => some additional variables
– We can vary the types of PEs, they do not have
to be of the same type
• Different types of CPUs
• Non –programmable PEs (perform only 1 function)


– We can use memory blocks of different sizes
• Also we do not have to require that every PE
access all memory
– Using private memories that are shared by only a few
PEs
– Therefore the MEM performance is optimized for the
units that use it
– We can use specialized interconnection
networks that provide only certain
connections

– Embedded MPs
• Make use of SIMD parallelism techniques
• But MIMD architectures are the dominant
mode of parallel machines in Embedded
Computing
• They tend to be heterogeneous (varied)
PEs
• Scientific MPs tend to be homogeneous
parallel machines (copies of the same
type of PEs)

• 5.2 Why Embedded Multiprocessors?
• MPs are commonly used for scientific and
business servers, so why need them in
embedded computing?
– Because many of them actually have to
support huge amounts of computation
– The best way to meet those demands is to
use MPs
• This is particularly true when we must meet real-
time constraints that are concerned with power
consumption

• Embedded MPs face more constraints than
scientific processors do
– Both intend to deliver high performance but
Embedded Systems must do something in addition
• They must provide real-time performance that is
predictable
• They often run at low energy and power levels
• They have to be cost effective (i.e. provide high
performance without using excessive amounts of
HW)

• The rigorous demands of embedded
computing push us toward several design
techniques
– Heterogeneous microprocessors are often
more energy-efficient and cost-effective
than symmetric multiprocessors
– Heterogeneous memory systems improve
real-time performance
– NoCs support heterogeneous
architectures

• 5.2.1 Requirements on Embedded
Systems
• Example: Computation in Cellular
Telephones

– A cellular telephone must perform a
variety of functions that are basic to
telephony
• Compute and check error-correction codes
• Perform voice compression and
decompression
• Respond to the protocol that governs
communication with the cellular network

• 5.2.1 Requirements on Embedded Systems
• Example: Computation in Cellular Telephones
– Furthermore, modern cell phones must perform a
variety of other functions that are required by
regulations or demanded by the marketplace
• In US, cell phones must keep track of their position in
case the user must be located for emergency services
– A GPS is often used to find the phone’s position
• Many cell phones play MP3 audio and also use MIDI or
other methods to play music for ring tones
• High-end cell phones provide cameras for still pictures
and video
• Cell phones may download application code from network

• Example: Computation in Video Cameras
– Video compression requires a great deal of computation,
even for small images
– Most video compression systems combine 3 basic methods
to compress video
• Lossless compression is used to reduce the size of the representation of the video
data stream
• Discrete cosine transform (DCT) is used to help quantize the images and reduce the
size of the video stream by lossy encoding
• Motion estimation and compensation allow the contents of one frame to be
described in terms of motion from another frame

– Most video compression systems combine the 3 basic
methods to compress video
– Of these 3, motion estimation is the most computationally
intensive
• Even an efficient motion estimation algorithm must perform a 16×16
correlation at several points in the video frame, and if must be
done for the entire frame
• For a QCIF frame which is commonly used in cell phones, we
have 176×144 pixels
– That frame is divided into 11×9 of these 16×16 macroblocks for motion estimation

• If we perform correlations for each macroblock
– We will have to perform 11×9×16×16 = 25,344 pixel comparisons
– All these calculations must be done on almost every frame, at
a rate of 15 or 30 frames/second!!!

– Most video compression systems combine 3 basic
methods to compress video
– Of these 3, motion estimation is the most
computationally intensive
– The DCT operator is also computationally intensive
• Even efficient algorithms require a large number of
multiplications to perform the 8×8 DCT that is commonly used
in video and image compression
– For example [Feig and Winograd] an algorithm for DCT uses 94
multiplications and 454 additions to perform an 8×8 2-D DCT
– This amounts to 148,896 multiplications per frame for a size
frame with 1,584 blocks

• 5.2.2 Performance and Energy
– Many embedded applications need lots of raw processing
performance
• But that is not enough, those computations have to be
performed efficiently
– [Austin et al. 2004] posed the embedded system
performance problem as “mobile supercomputing”
• Today’s PDA/Cell phones already perform a great deal
of what once was considered as requiring large
processors
– Speech recognition
– Video compression and recognition
– High-resolution graphics
– High-bandwidth wireless communication

– [Austin et al.] estimate that a mobile
supercomputing workload would require
about 10,000 SPECint of performance
– That means about 16× of that provided by a
2GHz Intel Pentium IV processor
– In the mobile environment, all this
computation must be performed at very
low energy
• Battery power is growing at only 5%/year

– Given that today’s highest-performance batteries
have an energy density close to that of TNT
– We may be close to the amount of energy that people are willing
to carry with them

=

– [Mudge et al.] estimate that to power the mobile
supercomputer with a battery for 5 days, with it being used
20% of the time
• It must consume no more than 74 mW
• Unfortunately, general-purpose processors do not meet those
trends
– Moore’s law: dictates that chip sizes double every
18 months => circuits run faster
• If we could make use of all the potential increase in speed, we
could meet the 10,000 SPECint performance target
• But trends show that we are not keeping up with performance
• The performance of commercial processors and predicted trends
• Traditional optimizations (pipelining, instruction-level
parallelism) are becoming less effective (they have previously
helped designers capture Moore’s law)

Performance trends for desktop processors
[Austin et al.] IEEE Computer Society

– [Mudge et al.] show that power consumption is
getting worse
– We need to reduce the energy consumption of the
processor to use it in a mobile supercomputer!
• But desktop processors consume more power with
every new generation
– Breaking away from these trends requires taking
advantage of the characteristics of the problem
• Adding units that are tuned to the core operations that we
need to perform and
• Eliminating HW that does not directly contribute to
performance for this equation
– By designing HW that meets its performance goals
efficiently, we reduce system’s power consumption

Power consumption trends for desktop
processors [Austin et al.]

– One key advantage that embedded system architects can
leverage is task-level parallelism
• Many embedded applications neatly divide into several tasks or
phases that communicate with each other
• Which is a natural and easily exploitable source of parallelism
– Desktop processors rely on instruction-level
parallelism (ILP) to improve performance
• But only a small amount of ILP is available in most
programs
– We can build custom multiprocessor architectures that
reflect the task-level parallelism available in the application
• And meet performance targets at much lower cost and
with much less energy

• 5.2.3 Specialization and Multiprocessors
– It is the combination of high performance, low power, and
real-time that drives us to use multiprocessors (MPs)
– And these requirements lead us further toward
heterogeneous processors
• Which starkly contrast with the symmetric multi-processors used
for scientific computation
– Multiprocessing Vs. Uniprocessing
• Even if we build a multiprocessor out of several copies of the same
type of CPU
– We may end up with a more efficient system than if we used a
uni-processor
• The manufacturing cost of a microprocessor is a non-linear function
of clock speed
– Customers pay considerably more for modest increases in clock
speed

– Real Time & Multiprocessing
• Real-time requirements also lead to multiprocessing
• When we put several real-time processes on the same CPU, they
compete for cycles
• But we cannot be sure that we can use 100% of the CPU if we want
to meet real-time deadlines
• Furthermore, we must pay for those reserved cycles at the nonlinear
rate of higher clock speed
– Multiprocessing & Accelerators
• The next step beyond symmetric microprocessors is heterogeneous
multiprocessors
• We can specialize all aspects of the multiprocessor: the PEs, the
memory, and the interconnection network
• Specializations understandably lead to lower power consumption;
perhaps less intuitively, they can also improve real-time behavior

– Specialization
• The following parts of embedded systems lend themselves to
specialized implementations
– Some operations, particularly those defined by standards, are not
likely to change
» The 8×8 DCT, for example, has become widely used well
beyond its original function in JPEG
» Given the frequency and variety of its uses, it is worthwhile to
optimize not just the DCT, but in particular its 8×8 form
– Some functions require operations that do not map well onto a
CPU’s data operations
» The mismatch may be due to several reasons
» For instance, bit-level operations are difficult to perform
efficiently on some CPUs
» The operations may require too many registers
» We can design either a specialized CPU or a special-purpose
HW unit to perform these functions

– Specialization
• The following parts of embedded systems lend themselves to
specialized implementations
– Highly responsive I/O operations may be best performed by an
accelerator with an attached I/O unit
– If data must be read, processed, written to meet a tight deadline
– For example, (in engine control) a dedicated HW unit may be
more efficient than a CPU
– Cost Vs. Power
• Heterogeneity reduces power consumption: it removes unnecessary
HW
• The additional HW required to generalize functions adds to both
dynamic and static power dissipation
• Excessive specialization can add so much communication cost that
the energy gain from specialization is lost
• However, specializing the right functions can lead to big energy
savings

– Real-Time Performance
• In addition to reducing costs, using multiple CPUs
can help with real-time performance
• We can often meet deadlines and be responsive to
interaction much more easily when we put those
time-critical processes on separate CPUs
• Specialized memory systems and interconnects
also help make the response time of a process
more predictable

• 5.2.4 Flexibility and Efficiency
– Use HW and SW
• Many embedded systems perform complex
functions that would be too difficult to implement
entirely in HW
• Translating all the standards to HW may be too
time-consuming and expensive
• Multiple standards encourage SW implementation
– For ex. must be able to play audio data in many different
formats: MP3, Dolby Digital, Ogg Vorbis, etc.
– These standards perform some similar operations but
cannot be easily collapsed into a few key HW units
– The reasonable choice: processors running SW, aided
by a few key HW units

• 5.3 Multiprocessor Design Techniques
– We discuss embedded multiprocessor design
methodologies in detail
– 5.3.1 Multiprocessor Design Methodologies
• The design of embedded multiprocessors is data-driven
and relies on analyzing programs
• We call these programs the workload, in contrast with the
term benchmark commonly used in computer architecture
• Embedded systems operate under real-time
constraints and overall throughput
– Therefore we often use a sample set of applications to
evaluate overall system performance
– These programs may not be the exact code run on the
final system and the final system may have many modes
– But using workloads is still useful and very important

• Benchmarks are generally treated as
independent entities
• While embedded multiprocessor design
requires evaluating the interaction between
programs
• The workload, in fact, includes data
inputs as well as the programs
themselves

• Multiprocessor-based embedded system design methodology

Operation
Workload counts, etc

PE, memory,
Platform
Platform-independent interconnect
design
optimizations design

Platform-dependent
Platform-independent
optimizations
measurements

Implementation

• This workflow includes both the design of the HW
platform and the SW that runs on the platform
• Before the workload is used to evaluate the
architecture, it generally must be put into good shape
with platform-independent optimizations
• Many programs are not written with embedded
platform restrictions, real-time performance or low
power in mind
• Using programs designed to work in non-real-time
mode with unlimited main memory can often lead to
bad architectural decisions

• Once we have the workload programs in shape, we
can perform simple experiments before defining an
architecture
– To obtain platform-independent measurements
• Simple measurements, such as dynamic instruction
count and data access patterns, provide valuable
information about the nature of the workload
• Using these platform-independent metrics, we can
identify an initial candidate architecture
– If the platform relies on static allocation, we may need to
map the workload programs onto the platform
– We then measure platform-dependent characteristics

• Based on these characteristics, we evaluate the
architecture, using both numerical measures and
judgment
• If the platform is satisfactory, then we are finished
• If not, we modify the platform and make a new round
of measurements
• Along the way, we need to design the components
of the multiprocessor
– The processing elements,
– The memory system, and
– The interconnects

• Once we are satisfied with the platform
– We can map the SW onto the platform
– During that process
» We may be aided by libraries of code and
» Compilers
– Most of the optimizations performed at this phase should
be platform-specific
» We must allocate operations to processing elements
» Allocate data to memories
» Allocate Communications to links
» We now also have to determine when things happen

– 5.3.2 Multiprocessor Modeling and Simulation
• [Cai & Gajski] defined a hierarchy of modeling methods for digital
systems and compared their characteristics
Communication Computation Communication PE
time time scheme Interface
Specification No No Variable No PEs

Component No Approximate Variable channel Abstract
(PE) assembly
Bus Approximate Approximate Abstract bus Abstract
arbitration channel
Bus Cycle accurate Approximate Protocol bus Abstract
functional channel
Cycle Approximate Cycle accurate Abstract bus Pin
accurate channel accurate
Implementation Cycle accurate Cycle accurate Wires Pin
accurate

• Most multiProc simulators are systems of
communicating simulators
• The component simulators represent CPUs,
memory elements, and routing networks
• The multiProc simulator itself negotiates
communication between those component simulators
• We can use the techniques of parallel computing to
build the multiProc simulator
• Each component simulator is a process, both in the
simulation metaphor and literally as a process running
on the host CPU’s operating system

• Consider the simulation of a write from a PE to a ME
(memory element)
• The PE and ME are each component simulators
that run as processes on the host CPU
• The WRITE operation requires a message from the
PE simulator to the ME simulator

PE ME
Simulator Message( write address, Simulator
data to be written)

• The MultiProc simulator must route that message
by determining which simulation process is
responsible for the address of the write operation
• After performing the required mapping, it sends a
message to the ME simulator, asking it to perform
the write
• Most MultiProc simulators are assuming
homogeneous MP architectures, and use that
assumption to build simulation shortcuts
– However, many Embedded MPs are heterogeneous, and
therefore cannot use these optimizations

• SystemC (http://www.systemc.org) is a widely used
framework for transaction-level design of
heterogeneous MultiProcs
• It is designed to facilitate the simulation of
heterogeneous architectures built from combinations
of hardwired blocks and programmable processors
• SystemC is built on top of C++
– Defines a set of classes used to describe the
system being simulated
– A simulation manager guides the execution of the
simulator

Hwswcd mp so_c_1

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Hwswcd mp so_c_1

Similar a Hwswcd mp so_c_1 (20)

Último

Último (20)

Hwswcd mp so_c_1