Multicore 101: Migrating Embedded Apps to Multicore with Linux
1. Multicore 101: Migrating Embedded Applications to a
Multicore Environment with Linux
Presented by MontaVista Software and Freescale Semiconductor
Ian Forsyth
Senior Enablement Architect
Freescale Semiconductor
Brad Dixon
Director of Product Management
MontaVista Software
Attend Vision for more in-depth multicore sessions
www.mvista.com/Vision
2. Agenda
►The
Challenge In Migrating Applications
The “Net Effect”
• Changing networking topology
• The multicore challenge
•
►Proposed
Multicore Solutions
Combined hardware/software
• Virtualization and hypervisor
•
►The
Pathway to Migrating Your Applications
Contain – Exploit – Analyze – Optimize
• Use the right tools
•
►Learn
•
more and evaluate multicore solutions
Evaluate MontaVista TestDrive: Freescale + MontaVista Linux
Multicore 101
3. The “Net Effect”
Metro Carrier
Edge Router
IMS
Controller
SSL, IPSec,
Firewall
Serving Node
Router (GSN)
Converged Networking
Storage
Networks
IP Services
TelePresence
Enterprise
Wireless
Access Gateway
Access Point
Aggregation
Integrated
Services
Routers
Unified Threat
Management
Network
Admission
Control
Service
Provider
Routers
Multicore 101
Networking trends
drive the need for
more performance
4. The Changing Networking Topology
► Layer
4-7 (Application)
processing in the network is
now common
► Increasing
Integration in
datacom deployments
► Both
driving higher
computational capabilities
from hardware vendors
Multicore 101
5. Why Multicore in Embedded Networks?
► Demand
for differentiating features
1xCPU
Device Hot-spot
Power Limit
services are implemented in
software running on general purpose
CPUs
Power
► Advance
nxCPU
► Frequency
scaling of CPU cores no
longer valid, primarily due to power
► Multicore
processors viewed as most
viable approach
Multicore 101
Performance Requirement
6. The Multicore Challenge – It’s All About the Software
Multicore
Software
► Multicore
silicon devices have raced ahead of the
embedded software market’s ability to support them
L2 Cache
Power Architecture™
Core
► Millions
of lines of single-threaded legacy code will
need to be written in a parallel fashion in order to
utilize multicore devices
Single-threaded
Legacy Software
L2 Cache
D-Cache
L2 Cache
Power Architecture™
Core
D-Cache
Core
I-Cache
L2 Cache
Power Architecture™
Core
D-Cache
a paradigm shift in how developers must think
about and implement future programs
I-Cache
Power Architecture™
D-Cache
► Creates
I-Cache
I-Cache
L2 Cache
Power Architecture™
Core
D-Cache
I-Cache
► No
automated or “quick-fix” approaches for this software migration and paradigm
shift – significant programmer effort is required
► Tools
and support – simulators, compilers, OS, virtualization packages,
performance profilers, debuggers, example applications and training will all be key
to the widespread adoption of multicore solutions
Multicore 101
7. Multicore Tools and Solutions
Software Pyramid
Market-specific multicore stacks, apps,
libraries. Support green field.
Support for standard and OS-dependent
programming models, often leveraging
multiprocessor.
Base multicore infrastructure: Operating
System, boot standards.
First-rate tools: debuggers, performance
and trace analyzers, simulators, compilers.
Multicore 101
Stacks
N/W Accel
Early Code
Partitioning
Hardware &
Software
Hypervisor
SMP/AMP OS’s
Advance Debug
Libraries
8. QorIQ™ Solution Platforms
Applications
Applications
IDE (compiler / debugger / build tools)
Optimized High-Speed Drivers
Hypervisor
Simics Virtualized
Development
Environment
Functional Model
API
Optimized High-Speed Drivers
Hypervisor
Freescale QorIQ™ Silicon
Performance Model
Simulation to Hardware: Same Software
Freescale-supplied
Multicore 101
9. Hybrid Functional/Performance Simulator
Functional Model
CPU
Performance Model
CPU
Ethernet
CPU
CPU
I/O
CPU
ROM
RAM
API
Ethernet
Bus
CPU
Hardware
Acceleration
I/O
Hardware
Acceleration
Functional Mode Simulation - High Speed
Periodic
Checkpoints
Performance Mode Simulation
Functional Mode Simulation
Simulated Time
Multicore 101
10. Virtualization for Reduced Cycle Time
Core
e200, e300
e500, e600, …
A Hybrid Model:
Functional
Provides programmer's
view of the SoC
Products
and
Systems
Deterministic
Non-invasive
SOC
Single
Simulation
Environment
MPC8360/MPC8641D
MPC8548/MPC8572
Multicore Platform/ …
Control of time
Systematic control
of validation and error
Boards
Control of cores
Control of configuration
Systems
Performance
Force and detect race
conditions
Optimized solutions
Freescale with Virtutech and MontaVista provide a multicore development platform that
accelerates software development before and after silicon availability
Multicore 101
11. MPC8641/40D Dual Core Block Diagram
► Dual
e600 PowerPC cores @
1.25/1.0 GHz
• 1MB L2 Cache w/ECC per core
• 36-bit physical addressing
► System
Unit
• 64b DDR/DDR2 w/ECC
• 4x 10/100/1000 Ethernet Controllers
► High-speed
Interfaces
• 1x/4x SRIO (2.5GB/s) and x1/x2/x4/x8
PCI-Express (4GB/s)
• OR two x1/x2/x4/x8 PCI-Express
(8GB/s)
► Pin
and Software compatible to
MC8641D
► Max
Power (Watts)
• 31.0 W @ 1.25 GHz
• 21.0 W @ 1.00 GHz
► Production
Availability
•
0 to 105C – Now
• -40 to 105C – Q408
► MontaVista
•
•
Multicore 101
commercial support
Professional Edition 5.0
Carrier Grade Edition 5.0
12. QorIQ™ P4080 Multicore
It’s a smarter approach to multicore.
Freescale’s Multicore Platform
►
Innovative Multicore Micro-architecture for
unprecedented computing efficiency, performance
and scalability.
•
•
•
On-chip coherency fabric
Back-side cache per CPU core
On-demand application acceleration
Features
• Eight e500mc cores
• CoreNet™ scales to 32 cores
• PCI Express® 2.0, 10GbE
• PME 2.0, SEC 4.0
• Data path acceleration
• Trust/secure boot
• Hypervisor
►
Multicore Simulation Environment for accurate,
fast code development and debugging.
•
•
•
Fully tap the capabilities of the multicore platform
Debug software not hardware
Dynamic, real-time debug with non-intrusive capture
• Standardized debug
• Virtualization with
real applications
• High-performance SoC
• Advanced technology
• Tier one partnerships
►
45-nm Process Technology for industry-leading
power-to-performance solution.
•
Provides highest instructions-per-cycle (IPC)
and frequency for given Milliwatt/area
Multicore 101
• Outstanding ecosystem
• MontaVista Linux support
13. Datapath Acceleration Architecture
QorIQ™ P4
Platform DPAA
Network
Interfaces
Parse
Datapath Acceleration
Architecture simultaneously
enables a lower complexity
software environment as
well as very high networking
performance
Congestion
Mgmt
Classify
FMan
Steer
Policing
QMan
BMan
Stash
Context
Enqueue
Manage
Work Q
Cores
Multicore 101
Accelerators
14. Multicore Operating Systems
► Wide variation of customer use-cases
• Multiple operating systems utilized across cores on a single device
Proprietary, 3rd party and Open Source multicore operating systems
• Symmetric Multi-Processing (SMP) and Asymmetric Multi-Processing (AMP), often running
concurrently
• Often no OS, or engineered light OS, used on forwarding/data plane cores
► Leverage Power Architecture™ technology’s 3rd party OS ecosystem
Freescale embedded Hypervisor
Freescale boot standards, including u-boot
Leverage open boot protocol and API standards (e.g. Power.org™)
Freescale Light Weight Executive (LWE) for run to completion data plane processing
Demonstrate performance and provide reference example for customers
Services
MontaVista
Linux®
Forwarding/ Data Plane
Light Weight
Executive
MontaVista
Linux®
AMP
Power
Architecture™
Core
Power
Architecture™
Core
Control Plane
MontaVista
Linux®
AMP
Power
Architecture™
Core
Power
Power
Architecture™ Architecture™
Core
Core
Multicore 101
SMP
Power
Architecture™
Core
Power
Architecture™
Core
Power
Architecture™
Core
15. Light Weight Executive Summary
►The
LWE provides a set of services and abstractions to an
application
►Focus is on run-to-completion model
Application
Software on other Cores–
e.g. running Linux®
Light Weight Executive
interaction
►Freescale
provides example applications to demonstrate the use of
the LWE
►The
LWE helps Freescale customers and partners develop
functionality using cores as highly optimized accelerators
Multicore 101
16. Hypervisor Contrasts
Freescale Hypervisor
Implementation
Guest
OS
Guest
OS
CPU
Traditional Hypervisor
Implementation
Guest
OS
CPU
Guest
OS
CPU
Requirement: isolation,
performance
Requirement: solves problem of
under-utilized CPUs, plus isolation
Implications: No more than one OS
per core, OS has direct control of
high-speed peripherals
Implications: more than one OS
per core, complexity, performance
implications
QorIQ™ P4080 hypervisor hardware assists
in meeting both requirement sets
Multicore 101
17. Natural Virtualization via QorIQ™ P4080 Datapath
►Datapath
decouples cores and peripherals– allows N cores to
share M peripherals
►Accessed
by “Portals” that are per-core
►Allows
direct and efficient access by cores to many high-speed
peripherals
Cores can access the same network
interface with no SW synchronization
because cores have their own portals
portal
Power
Architecture™
Core
Network
Interface
P4080 Datapath
portal
Power
Architecture™
Core
Multicore 101
18. Solution
Solution = Freescale software + ecosystem software + customer software
Partition Mgmt.
MontaVista
Applications
High Level
IPC
Stacks
Example Apps
L
Stacks
Applications
High Level
IPC
W
Linux
Drivers
IPC
E
Drivers
IPC
Hypervisor
Hypervisor
Freescale QorIQ™ Silicon
Freescale QorIQ™ Silicon
Freescale
3rd Party and/or Customer
Multicore 101
19. Market Analysis
“Developers overwhelmingly
voted for the chip's softwaredevelopment tools as the
most important thing when
evaluating a new embedded
processor.”
“The most valuable feature of
a chip isn't even the chip
itself.
Compilers and debuggers
trump MIPS and megahertz.”
- Jim Turley, ESD
Source: Embedded Systems Design Survey
Multicore 101
20. Migrating to Multicore: What is the pathway?
►Contain
►Exploit
►Analyze
►Optimize
Multicore 101
21. Containment
Goal: Migrate application codebase to multicore
platform without disruption
►Risk
– concurrent execution will expose latent race
conditions and synchronization issues
►Technique
– utilize Linux's processor and interrupt
affinity APIs to contain your application's threads and
processes to a single core
Multicore 101
25. Migration with Containment
The designer can explicitly control which CPUs are permitted to
handle particular threads and interrupts
Shown on Freescale 8641D multicore processor
Multicore 101
27. Why SMP?
►Multicore
CPU's can permit a number of processing scenarios
►SMP
maximizes run-time flexibility to match CPU to the needs of
the moment
►SMP ends up playing a role in many system architectures
►Combined with a hypervisor SMP does not exclude any other
design options
Multicore 101
28. Linux’s Long March to Multicore
►Linux
has been MC ready for years
►Kernel, drivers, protocol stacks, and
apps are ready
►As core count scales the focus shifts
to exploiting MC at the application
layer
Multicore 101
30. Sidebar Summary
►SMP
is the natural way for Linux to exploit multicore processors.
►Hypervisors can permit new flexibilities
►New hardware features are making hypervisor based
architectures more efficient to use
Multicore 101
31. Migrating to Multicore: What is the Pathway?
►Contain
•
Migrate to multicore but
contain code to a single core
►Exploit
►Analyze
►Optimize
Multicore 101
32. Exploit
Goal: Identify code that will benefit from multicore execution
and modify code to exploit available cores
Multicore 101
33. Application Architectures to Exploit MC
Objective: scale efficiently across multiple cores so that more client
work can be handled rapidly
►
Key question is how to map client requests (or packets)
to workers quickly and obtain speed-up from multicore
Multicore 101
34. Application Characteristics
►Each
request requires a small
amount of work
►Requests are largely
independent of each other
►Requires read-only access to
a moderate amount of state
►Small amount of state may
travel with the request
►Must be able to manage
overload effectively
Multicore 101
35. Application Characteristics
►Each
request requires a small
amount of work
►Requests are largely
independent of each other
►Requires read-only access to
a moderate amount of state
►Small amount of state may
travel with the request
►Must be able to manage
overload effectively
►Some
Multicore 101
anti-patterns
Non-concurrent
• Process/Thread per client
• Spawn process/thread per
request
• HPC message passing such
as MPI
•
36. Application Characteristics
►Each
request requires a small
amount of work
►Requests are largely
independent of each other
►Requires read-only access to
a moderate amount of state
►Small amount of state may
travel with the request
►Must be able to manage
overload effectively
►Some
anti-patterns
Non-concurrent
• Process/Thread per client
• Spawn process/thread per
request
• HPC message passing such
as MPI
•
For telecom/datacom applications an event driven
architecture is ideal to facilitate multicore migration
Multicore 101
37. Sample Application Architecture
Similar to that used by memcached & Apache
► Dispatcher
can handle overload, monitoring, etc.
► Multicore awareness only for central services
► Plugable Dispatcher is feasible if planned correctly
► Managing global, per service, per session, and per request state is the
battleground for scalability
Multicore 101
38. Migrating to Multicore: What is the Pathway?
►Contain
•
Migrate to multicore but
contain code to a single core
►Exploit
•
Use an event driven
architecture to add explicit
functional parallelism
►Analyze
►Optimize
Multicore 101
39. Analyze
Goal: Understand MC performance bottlenecks and
diagnose unexpected faults
►
Benchmark first... the bottlenecks may not be
where you think they are
Multicore 101
40. Analysis Tools
Profiling
Can be used for far more than CPU cycles per function or line
• e500mc core has a rich set of performance attributes it can
monitor
• MontaVista DevRocket can use oprofile to collect and correlate
this data to your code
•
Runtime Monitoring
•
“top” in SMP mode will give you a broad overview of CPU stats
Tracing
•
Fine grained CPU-aware tracing
Multicore 101
46. Migrating to Multicore: What is the Pathway?
►Contain
•
Migrate to multicore but
contain code to a single core
►Exploit
•
Use an event driven
architecture to add explicit
functional parallelism
►Analyze
•
Use available profiling,
tracing, and performance
monitoring tools and APIs
►Optimize
Multicore 101
47. Optimize
Goal: Get the most from the available MC performance
►Focus
attention on areas where Amdahl's law indicates the
most benefit can occur!
►Leverage
data parallelization for CPU bound computations
►Utilize
interrupt and process/thread affinity to tune the
system
Multicore 101
48. Migrating to Multicore: What is the Pathway?
►Contain
•
Migrate to multicore but
contain code to a single core
►Exploit
•
Use an event driven
architecture to add explicit
functional parallelism
►Analyze
•
Use available profiling,
tracing, and performance
monitoring tools and APIs
►Optimize
•
Specialize cores as needed.
Explore other MC
optimizations
Multicore 101
49. MontaVista Support for Freescale Multicore
►Carrier
Grade Edition 4.0
8572
• 8641D, 8640D
►Professional
•
►Carrier
•
•
8641D, 8640D
►Professional
Grade Edition 5.0
Edition 4.0
Edition 5.0
8572
• 8641D, 8640D
•
8641D, 8640D
Freescale P4080 operating today on the Virtutech
Simics simulator in advance of hardware availability
MontaVista offers comprehensive support of Freescale
Power Architecture processors today
Multicore 101
50. Two Ways to Learn More About Multicore
October 1-3, 2008 San Francisco, CA
Where embedded Linux gets real
MontaVista Vision
MontaVista TestDrive
For more information on in-depth
multicore sessions, visit:
Evaluate Freescale multicore and
MontaVista Linux for free, visit:
www.mvista.com/vision
www.mvista.com/freescale/eval
Multicore 101