Más contenido relacionado A Breakthrough New CPU Architecture Revives IPC Scaling1. A Breakthrough New CPU Architecture Revives IPC Scaling Mohammad Abdallah Founder, President and CTO
Linley Processor Conference
October 23, 2014 2. •
Emerging from stealth mode
•
Developed new VISC™ Architecture
•
7 years, $125M R&D
•
~250 employees , 75+ patents filed
Introducing Soft Machines™
©Copyright 2014, All Rights Reserved 2
3. The Death of CPU Scaling
©Copyright 2014, All Rights Reserved 3
“The failure of CPU scaling after 30 years of continual improvements may have slammed the door on the easiest and most common type of performance scaling…”
The Death of CPU Scaling
ExtremeTech (2012)
2014
Microprocessor Scaling Realities after 2004
Transistor scaling continues
Clock speed flat
Power budget flat
Perf/clock flat
Source: “The Free Lunch is Over”, Herb Sutter 4. Industry Response: Multi-Core
4
Core1
Core2
Thread1
Thread2
Advantages:
-
Utilizes growing transistor budget
-
Performance scaling for parallel code
-
Improves throughput
Challenges:
-
ST performance doesn’t scale
-
Threading/multicore coding complexity
-
Amdahl’s Law of diminishing returns
-
Dark silicon
©Copyright 2014, All Rights Reserved
5. •
Revive CPU performance scaling
•
Utilize Moore’s Law transistor scaling
•
Mitigate dark silicon
•
Liberate ISA dependency
CPU Architecture Challenge
©Copyright 2014, All Rights Reserved 5
6. VISC™ Architecture Wave
6
RISC (MIPS)
CISC
(IBM/Intel)
VISC (Soft Machines)
Software Scalability/Productivity
Compilation
Concurrency Extraction
Assembly
Device Physics Scalability
Short Pipeline Code Memory size
Deep OoO Pipeline
Processor Speed
Virtual Cores/Threads Processor Power
Late 1980s – 2010s
1970s – early 1980s
2010s
VISC Architecture scales on both physical and software productivity layers
©Copyright 2014, All Rights Reserved
7. VISC™ Processor Block Diagram
©Copyright 2014, All Rights Reserved 7
L2$ & Memory
Sequential Code
SW Single Thread
Core2
Core1
L1 D$
L1 D$
Core4
Core3
L1 D$
L1 D$
Virtual Cores
Global Front End
Virtual HW Threads
(HW threadlets)
Virtual
Core1
Virtual Core2
Virtual
Core3
Virtual
Core4 8. VISC™ CPU Usage Example
©Copyright 2014, All Rights Reserved 8
or
•
VISC dynamically allocates resources across virtual cores based on individual application needs
•
Performance/watt balanced for both single & multi-thread applications
Heavy App
Dual SW Threads
Single SW Thread
Heavy App
Light App
Virtual Cores Virtual HW Threads/Threadlets
Core2
Core1
L1 D$
L1 D$
Virtual
Core1
Virtual
Core2
Virtual Cores
Virtual HW Threads/Threadlets
Core2
Core1
L1 D$
L1 D$
Virtual
Core1 9. VISC™ Architecture Prototype Pipeline
©Copyright 2014, All Rights Reserved 9
Fetch
Allocate/ Dispatch
EXE
Mem/long latency Execution
RF read
Virtual Thread Formation
Pipeline of Virtual Threads Across the Virtual Cores
L2$ & Memory
SW Single Thread
Global Front End
Core2
Core1
L1 D$
L1 D$
Virtual
Core1
Virtual
Core2
Virtual Cores
Virtual HW Threads
(HW threadlets) 10. VISC™ Revives IPC Curve
10
ARM
A15
1C
Intel
Atom
1C
Soft Machines
2VC
Proto
Apple
A7
1C
ARM
A57
1C
Intel
Haswell
1C
Compiled Code
32-bit
32-bit
32-bit
32-bit
32-bit
64-bit
Cache
1M
2M
1M
1M+4M
2M
2M
Pipeline
Moderate
Moderate
Shallow
Moderate
Moderate
Deep
IPC(SPEC 2006)*
0.71
0.69
2.1
1.0
.87
1.39
* Company conducted benchmark tests and projections, using industry-standard Compiler GCC 4.6 or equivalent
Mobile CPU designs are pursuing higher ARCH/μARCH complexity
2006
The Basic
A8
2-way
2009
The Simple
A9
2-way OoO
2011 The Moderate A15 3-way
2013
The Big
Apple A7
6-way
2014
The Ultimate
Haswell
8-way
©Copyright 2014, All Rights Reserved
11. •
Extracting ILP has significant complexity
•
OoO complexity increases quadratically with machine width
•
VISC complexity increases linearly with number of virtual cores
•
VISC Performance/Watt utilizes linear scaling
VISC™ Concurrency Extraction Linear vs. Quadratic Complexity
©Copyright 2014, All Rights Reserved 11
12. System Energy Approach: DRVFS
12
Virtual Cores – DRVFS
•
DRVFS: linear increase in power
•
P No. of virtual core resources
•
Higher Perf/MHz enables DVFS scaling DOWN
Physical Cores – DVFS
•
DVFS: quadratic increase in power
•
P V2 * F
•
Lower Perf/MHz requires DVFS scaling UP
Use Case: Rush to low power mode (boosting performance or response time)
Core1
©Copyright 2014, All Rights Reserved
13. VISC™ Single Thread SPEC/Watt
13
Mobile
Server
Same performance in 1/4-1/3rd power or 1.7-2.2x perf at the same power*
* Company conducted benchmark tests and projections for 28nm
1C App CPU
Single Thread Performance
Power
1.7x
1/3
1/4
2.1x
1.8x
2.2x
1VC (2C)
1VC (4C)
©Copyright 2014, All Rights Reserved
14. VISC™ Dual Thread SPEC/Watt
14
* Company conducted benchmark tests and projections for 28nm
2C App CPU
Mobile
Server
Power
Dual Thread Performance
1.4x
1.5x
1/2
0.4x
1.8x
1.9x
Same performance in 0.4 to 0.5x of power or 1.4 - 1.9x perf at the same power*
2VC (2C)
2VC (4C)
©Copyright 2014, All Rights Reserved
15. VISC™ Technology Prototype
15
Working Silicon
•
VISC Processor Proof-of-Concept Prototype
•
IPC scalability
•
VISC architecture
•
Software efficiency
•
Full Platform
•
VISC Dual Virtual Core Processor
•
SoC with 3D, Video, DRAM controller, HD video….
•
Full System functionality
•
Linux OS
•
UEFI BIOS
•
Benchmarks running on Linux
•
Android ICS booting
©Copyright 2014, All Rights Reserved
16. 16
Silicon Results: Performance/MHz
Dual Virtual Core/A15 IPC Ratio
©Copyright 2014, All Rights Reserved
17. VISC™ Architecture
17
Virtual SW layer
Guest Sequential Code
OS & Hypervisor
Single Thread
Guest ISA
Virtual ISA
L2$ & Memory
Core2
Core1
L1 D$
L1 D$
Core4
Core3
L1 D$
L1 D$
Virtual
Core1
Virtual Core2
Virtual
Core3
Virtual
Core4
Virtual Cores
Global Front End
Virtual HW Threads/Threadlets ©Copyright 2014, All Rights Reserved
18. Converter
VISC™ Run-time SW Architecture
18
Low level Virtual Machine
High level Virtual Machine
Guest Code (ARM,X86)
Dynamic optimization
VISC™ Processor
Guest/VM to native mapping
Native Code
SMI API
Hot Pass
©Copyright 2014, All Rights Reserved
19. •
Silicon proven VISC™ architecture delivers 3-4x IPC advantage on single and multi-threaded applications without software changes
•
Resulting in ~2-4x performance/watt advantage
•
VISC architecture is scalable from IoT to mobile to servers due to its modularity and symmetry
•
Number of virtual cores, virtual threads, and virtual instruction layer
•
VISC virtual instruction layer provides ISA agnostic and optimized run-time platform capabilities
Summary
©Copyright 2014, All Rights Reserved 19