Keynote (Dr Chien-Ping Lu) - How Many Cores Will We Need? - by Dr Chien-Ping Lu, Sr Director – Corporate Technology Office, MediaTek USA Inc.

HOW MANY CORES WILL WE NEED?
IN SEARCH OF PARALLEL KILLER APPS
CHIEN-PING LU, PHD
MEDIATEK INC

A GROUP OF HIPPOS IS CALLED …

A Crash
2 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL

A GROUP OF CROWS IS CALLED …

A Murder

A GROUP OF GIRAFFES IS CALLED …

From Wikipedia

A Tower

SO, IT IS NOT SURPRISING THAT WE USE

“A Parade” of elephants


“A Herd” of sheep

“An Army” of ants

FROM FREQUENCY TO MULTICORE SCALING

Power

Frequency

performance

Power

Single-core
Time

Multi-core
Power wall: 2005

IT SEEMS INEVITABLE THAT WE WILL NEED A MASSIVE NUMBER OF CORES

performance

Moderate
Time

Massive

DARK SILICON (OR DARK CORES)?

performance
8x  4x
4x  3x
2x

Time

16x  4x

HOW TO LIGHT UP THE CORES?
Redefine the cores to be heterogeneous

Search for parallel killer apps

power

Power ceiling

SIMT “cores”

Little cores

H.264 encoding

Big cores

Parallelism wall

Degree of Parallelism

Ray tracing

ARMY OF ANTS: SIMT CORES
FOR SIMT (SINGLE-INSTRUCTION-MULTIPLE-THREAD ) EXECUTION

SIMT is the execution model of HSA and
implemented in modern GPUs, with
MIMD flexibility and SIMD efficiency

A SIMT core runs 1 iteration of
the parallel loop
Parallel.For (…)

Front End
Front End
Front End

…
If (…) then

…

Else

…
SPE

SPE

ALU


ALU

ALU

A cluster of SIMT cores shares one front end in a SIMD
manner
Specialized Processing Engines
Wider SIMT

ALU
ALU
ALU

ALU
ALU

ALU
ALU
ALU

ALU

ALU
ALU
ALU

ALU

ALU
ALU
ALU

ALU
ALU

ALU

ALU
ALU
ALU

ALU
ALU
ALU

ALU
ALU

ALU
ALU
ALU

ALU
ALU
ALU
ALU

…

A branch is emulated
thru divergence

MASSIVELY PARALLEL WORKLOADS
• Problem size N can keep growing

• Visible serial workload s can be kept constant
• Parallel workload is speeded up by P, the number of cores
• Reduction overhead is proportional to log P (by a factor of r)

• "Embarrassingly" parallel, when there is no reduction overhead (r=0)

s
s

N
r log P

N/P
Time saved by P cores


REVISITING AMDAHL'S LAW

s1=50%,r=50%
s=50%, r=50%
10000
100

Speedup 
Speedup 

ss  N
P
ss rrlog P  1 / P
 log P N

Speedup

1000
N=16
N=16
N=64
N=64
N=256
N=256
P=N

10
100
10

1
1
2
2
4
4
8
8
16
16
32
32
64
64
128
128
256
256
512
512
1024
1024
2048
4096
8192

1 1
Degree of Parallelism (P)
Degree of Parallelism (P)

GRAPHICS KEEP MOVING

Highest grossing video
game of all-time bench 2.7 T-Rex
GL benchmark 2.1 Egypt
GFX
Recognized by 94% of
American Consumers
Pac-man, 1980
GL benchmark 2.5 Egypt

GFX bench 3.0 Manhattan

Mobile 3D Graphics

MEDIATEK FACE BEAUTIFICATION
WHEN IT COMES TO BEAUTY, THERE SEEMS TO BE NO LIMIT

Before


Skin tone adjustment
Wrinkle removal

Thinner face, bigger eyes

HIGH-PERFORMANCE COMPUTING (HPC) KEEPS SCALING OUT

More atoms

Top of Top500 1993-2012
1,000,000
100,000
Relative to 1993

 HPC from 1993 to 2012
‒GFLOPS ~ 130,000x
‒Cores ~ 11,000x
‒GHz ~ 10x

Higher grid resolution
More time steps

10,000
1,000

GFLOPS
Cores

100

10
1
1990 1995 2000 2005 2010 2015
0

GHz

THE MISSING LINKS
IN SEARCH OF PARALLEL KILLER APPS

Moore’s law

Better user
experience

Higher frequency
More cores

Bigger data
What bigger problems to
solve with bigger data?

How solving bigger problems
leads to better user experience?

More complex
Mining bigger data
Bigger problems
with Machine Learning
software

MACHINE LEARNING: TREND PREDICTION WITH POWERFUL MODELS
 Powerful models (with many knobs) tend to overfit the noise if the data set is not sufficiently large

350

 The explosive growth of data has made powerful
models feasible

250

 A model with 1 billion knobs, trained with 10
million images from YouTube was used in Google
Brain experiment to figure out the concepts of cats
and human faces by itself

300

200

150
100
50
0
-50

0

2

4

Samples

Data

Linear

Poly. (2nd order)

Poly. (6th order)
Source: Le et al., Building High-level Features Using
Large Scale Unsupervised Learning

6th-order polynomial undulates excessively
with only 4 samples

6

HOW TO DISTINGUISH CATS FROM DOGS?
ASIRRA
Animal Species Image Recognition for Restricting Access (from Microsoft Research)


CAN ASIRRA BE CRACKED?


WHY IS IT HARD?

Source: training set of Kaggle.com Dogs vs. Cats competition

IS THERE A MODEL FINDING OUT THAT THESE ARE THE SAME DOG?

Prancer, a 5-years-old toy poodle, before and after grooming

MINE THE SOLUTIONS FROM THE DATA
Dog-Cat
classifier

Theory of the differences
between dogs and cats?

Learn from many (12,500)
photos labeled as dogs or
cats
Machine Learning


SMART AND SMARTER CLIENTS IN THE ERA OF BIG DATA

Bigger
Big Data
Data
Smarter Client
Client

Cloud
Bigger Training
Big Training Set
Set

Bigger Machine
Machine Learning
Learning

In the cloud or
the clients
Powerful
Bigger
Model

Better Sensing
Sensing

Input
data

Better
Connectivity
Connectivity
Better
Answer
Answer

Local Machine
Learning

PARALLEL COMPUTING IN THE CLOUD AND AT THE CLIENTS
Examples:




dog/cat photos
Sensor readings

x

dog or cat
jogging, walking or driving

f x ai 

y

Model

Cloud Parallel
Computing with
more samples

Samples

( xn , yn )

ai 

Knobs

Tweak ai  to minimize the error between

f xn ai 

and

Model

Machine Learning

Client Parallel
Computing with
more knobs

yn

WHY HSA?
Machine learning happens in the
cloud and at the clients
Models run in the cloud or at the
clients
Need same ease of programming
and write-once-run-everywhere
for heterogeneous cores


Mediatek is one of the cofounders
of HSA Foundation
MediaTek is the first to introduce in
mobile SoC
 True Octa-Core
 Heterogeneous Multiprocessing
(HMP)

SCALE OUT AND SCALE IN WITH HETEROGENEOUS CORES
• Both the cloud and mobile clients
are limited by power
• Mobile devices need to keep
cool in our palms
• Data centers need to keep
our environment clean
• Carbon footprint of US datacenters is at the same level
as the airline industry
• A 1,000m2 datacenter consumes 1.5MW, enough to
power 1,000 US homes per year

In order to scale out, we need to scale in with
heterogeneous cores in the cloud and in our
palms

Typical 1,000 homes in US

THE NEW VIRTUOUS CYCLE
PERHAPS, LEADING TO COMPUTING LIKE OUR BRAIN

Moore’s law and
beyond

Better user
experience

More heterogeneous
cores

Mining bigger data
with Machine Learning

Bigger data

MASSIVELY PARALLEL WORKLOADS
• Can keep growing the problem size N

• The serial workload s can be kept constant
• The parallel workload is speeded up by P, the number of cores
• The reduction overhead is proportional to log P (by a factor of r)

• "Embarrassingly" parallel, when there is no reduction overhead (r=0)

s
s

N
r log P

N/P
Time saved by P cores


THE ELEPHANTS: CPU CORES
FOR MULTIPLE-INSTRUCTION-MULTIPLE-DATA (MIMD) EXECUTION

Retrofitted for moderately parallel
workloads, and not very efficient for
massively parallel workloads

Parallel.For (i)

…
If (…)
Front End
Front End

Front End
Front End

Front End
Front End

Front End
Front End

Front End
Front End

ALU
ALU

ALU
ALU

ALU
ALU

ALU
ALU

ALU
ALU

…

ALU
ALU

Else

Front End
Front End

…

…
A CPU core runs 1 iteration of the parallel loop
The same color means the same piece of code

DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices,
Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names
are for informational purposes only and may be trademarks of their respective owners.


Keynote (Dr Chien-Ping Lu) - How Many Cores Will We Need? - by Dr Chien-Ping Lu, Sr Director – Corporate Technology Office, MediaTek USA Inc.

Recomendados

Recomendados

Más contenido relacionado

Más de AMD Developer Central

Más de AMD Developer Central (20)

Último

Último (20)

Keynote (Dr Chien-Ping Lu) - How Many Cores Will We Need? - by Dr Chien-Ping Lu, Sr Director – Corporate Technology Office, MediaTek USA Inc.