Keynote presentation, How Many Cores Will We Need?, by Dr Chien-Ping Lu, Sr Director – Corporate Technology Office, MediaTek USA Inc., at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
Keynote (Dr Chien-Ping Lu) - How Many Cores Will We Need? - by Dr Chien-Ping Lu, Sr Director – Corporate Technology Office, MediaTek USA Inc.
1. HOW MANY CORES WILL WE NEED?
IN SEARCH OF PARALLEL KILLER APPS
CHIEN-PING LU, PHD
MEDIATEK INC
2. A GROUP OF HIPPOS IS CALLED …
A Crash
2 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
3. A GROUP OF CROWS IS CALLED …
A Murder
3 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
4. A GROUP OF GIRAFFES IS CALLED …
From Wikipedia
A Tower
4 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
5. SO, IT IS NOT SURPRISING THAT WE USE
“A Parade” of elephants
5 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
“A Herd” of sheep
“An Army” of ants
6. FROM FREQUENCY TO MULTICORE SCALING
Power
Frequency
performance
Power
Single-core
Time
6 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
Multi-core
Power wall: 2005
7. IT SEEMS INEVITABLE THAT WE WILL NEED A MASSIVE NUMBER OF CORES
performance
Moderate
Time
7 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
Massive
8. DARK SILICON (OR DARK CORES)?
performance
8x 4x
4x 3x
2x
Time
8 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
16x 4x
9. HOW TO LIGHT UP THE CORES?
Redefine the cores to be heterogeneous
Search for parallel killer apps
power
Power ceiling
SIMT “cores”
Little cores
H.264 encoding
Big cores
Parallelism wall
Degree of Parallelism
9 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
Ray tracing
10. ARMY OF ANTS: SIMT CORES
FOR SIMT (SINGLE-INSTRUCTION-MULTIPLE-THREAD ) EXECUTION
SIMT is the execution model of HSA and
implemented in modern GPUs, with
MIMD flexibility and SIMD efficiency
A SIMT core runs 1 iteration of
the parallel loop
Parallel.For (…)
Front End
Front End
Front End
…
If (…) then
…
Else
…
SPE
SPE
ALU
10 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
ALU
ALU
A cluster of SIMT cores shares one front end in a SIMD
manner
Specialized Processing Engines
Wider SIMT
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
…
A branch is emulated
thru divergence
11. MASSIVELY PARALLEL WORKLOADS
• Problem size N can keep growing
• Visible serial workload s can be kept constant
• Parallel workload is speeded up by P, the number of cores
• Reduction overhead is proportional to log P (by a factor of r)
• "Embarrassingly" parallel, when there is no reduction overhead (r=0)
s
s
N
r log P
N/P
Time saved by P cores
11 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
12. REVISITING AMDAHL'S LAW
s1=50%,r=50%
s=50%, r=50%
10000
100
Speedup
Speedup
ss N
P
ss rrlog P 1 / P
log P N
Speedup
1000
N=16
N=16
N=64
N=64
N=256
N=256
P=N
10
100
10
1
1
2
2
4
4
8
8
16
16
32
32
64
64
128
128
256
256
512
512
1024
1024
2048
4096
8192
1 1
Degree of Parallelism (P)
Degree of Parallelism (P)
12 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
13. GRAPHICS KEEP MOVING
Highest grossing video
game of all-time bench 2.7 T-Rex
GL benchmark 2.1 Egypt
GFX
Recognized by 94% of
American Consumers
Pac-man, 1980
GL benchmark 2.5 Egypt
GFX bench 3.0 Manhattan
Mobile 3D Graphics
13 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
14. MEDIATEK FACE BEAUTIFICATION
WHEN IT COMES TO BEAUTY, THERE SEEMS TO BE NO LIMIT
Before
14 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
Skin tone adjustment
Wrinkle removal
Thinner face, bigger eyes
15. HIGH-PERFORMANCE COMPUTING (HPC) KEEPS SCALING OUT
More atoms
Top of Top500 1993-2012
1,000,000
100,000
Relative to 1993
HPC from 1993 to 2012
‒GFLOPS ~ 130,000x
‒Cores ~ 11,000x
‒GHz ~ 10x
Higher grid resolution
More time steps
15 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
10,000
1,000
GFLOPS
Cores
100
10
1
1990 1995 2000 2005 2010 2015
0
GHz
16. THE MISSING LINKS
IN SEARCH OF PARALLEL KILLER APPS
Moore’s law
Better user
experience
Higher frequency
More cores
Bigger data
What bigger problems to
solve with bigger data?
How solving bigger problems
leads to better user experience?
More complex
Mining bigger data
Bigger problems
with Machine Learning
software
16 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
17. MACHINE LEARNING: TREND PREDICTION WITH POWERFUL MODELS
Powerful models (with many knobs) tend to overfit the noise if the data set is not sufficiently large
350
The explosive growth of data has made powerful
models feasible
250
A model with 1 billion knobs, trained with 10
million images from YouTube was used in Google
Brain experiment to figure out the concepts of cats
and human faces by itself
300
200
150
100
50
0
-50
0
2
4
Samples
Data
Linear
Poly. (2nd order)
Poly. (6th order)
Source: Le et al., Building High-level Features Using
Large Scale Unsupervised Learning
17 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
6th-order polynomial undulates excessively
with only 4 samples
6
18. HOW TO DISTINGUISH CATS FROM DOGS?
ASIRRA
Animal Species Image Recognition for Restricting Access (from Microsoft Research)
18 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
19. CAN ASIRRA BE CRACKED?
19 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
20. WHY IS IT HARD?
Source: training set of Kaggle.com Dogs vs. Cats competition
20 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
21. IS THERE A MODEL FINDING OUT THAT THESE ARE THE SAME DOG?
Prancer, a 5-years-old toy poodle, before and after grooming
21 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
22. MINE THE SOLUTIONS FROM THE DATA
Dog-Cat
classifier
Theory of the differences
between dogs and cats?
Learn from many (12,500)
photos labeled as dogs or
cats
Machine Learning
22 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
23. SMART AND SMARTER CLIENTS IN THE ERA OF BIG DATA
Bigger
Big Data
Data
Smarter Client
Client
Cloud
Bigger Training
Big Training Set
Set
Bigger Machine
Machine Learning
Learning
In the cloud or
the clients
Powerful
Bigger
Model
Better Sensing
Sensing
Input
data
23 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
Better
Connectivity
Connectivity
Better
Answer
Answer
Local Machine
Learning
24. PARALLEL COMPUTING IN THE CLOUD AND AT THE CLIENTS
Examples:
dog/cat photos
Sensor readings
x
dog or cat
jogging, walking or driving
f x ai
y
Model
Cloud Parallel
Computing with
more samples
Samples
( xn , yn )
ai
Knobs
Tweak ai to minimize the error between
f xn ai
and
Model
Machine Learning
24 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
Client Parallel
Computing with
more knobs
yn
25. WHY HSA?
Machine learning happens in the
cloud and at the clients
Models run in the cloud or at the
clients
Need same ease of programming
and write-once-run-everywhere
for heterogeneous cores
25 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
Mediatek is one of the cofounders
of HSA Foundation
MediaTek is the first to introduce in
mobile SoC
True Octa-Core
Heterogeneous Multiprocessing
(HMP)
26. SCALE OUT AND SCALE IN WITH HETEROGENEOUS CORES
• Both the cloud and mobile clients
are limited by power
• Mobile devices need to keep
cool in our palms
• Data centers need to keep
our environment clean
• Carbon footprint of US datacenters is at the same level
as the airline industry
• A 1,000m2 datacenter consumes 1.5MW, enough to
power 1,000 US homes per year
In order to scale out, we need to scale in with
heterogeneous cores in the cloud and in our
palms
26 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
Typical 1,000 homes in US
28. THE NEW VIRTUOUS CYCLE
PERHAPS, LEADING TO COMPUTING LIKE OUR BRAIN
Moore’s law and
beyond
Better user
experience
More heterogeneous
cores
Mining bigger data
with Machine Learning
28 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
Bigger data
29. MASSIVELY PARALLEL WORKLOADS
• Can keep growing the problem size N
• The serial workload s can be kept constant
• The parallel workload is speeded up by P, the number of cores
• The reduction overhead is proportional to log P (by a factor of r)
• "Embarrassingly" parallel, when there is no reduction overhead (r=0)
s
s
N
r log P
N/P
Time saved by P cores
29 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
30. THE ELEPHANTS: CPU CORES
FOR MULTIPLE-INSTRUCTION-MULTIPLE-DATA (MIMD) EXECUTION
Retrofitted for moderately parallel
workloads, and not very efficient for
massively parallel workloads
Parallel.For (i)
…
If (…)
Front End
Front End
Front End
Front End
Front End
Front End
Front End
Front End
Front End
Front End
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
…
ALU
ALU
Else
Front End
Front End
…
…
A CPU core runs 1 iteration of the parallel loop
The same color means the same piece of code
30 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL