SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
HOW MANY CORES WILL WE NEED?
IN SEARCH OF PARALLEL KILLER APPS
CHIEN-PING LU, PHD
MEDIATEK INC
A GROUP OF HIPPOS IS CALLED …

A Crash
2 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
A GROUP OF CROWS IS CALLED …

A Murder
3 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
A GROUP OF GIRAFFES IS CALLED …

From Wikipedia

A Tower
4 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
SO, IT IS NOT SURPRISING THAT WE USE

“A Parade” of elephants

5 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL

“A Herd” of sheep

“An Army” of ants
FROM FREQUENCY TO MULTICORE SCALING

Power

Frequency

performance

Power

Single-core
Time
6 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL

Multi-core
Power wall: 2005
IT SEEMS INEVITABLE THAT WE WILL NEED A MASSIVE NUMBER OF CORES

performance

Moderate
Time
7 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL

Massive
DARK SILICON (OR DARK CORES)?

performance
8x  4x
4x  3x
2x

Time
8 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL

16x  4x
HOW TO LIGHT UP THE CORES?
Redefine the cores to be heterogeneous

Search for parallel killer apps

power

Power ceiling

SIMT “cores”

Little cores

H.264 encoding

Big cores

Parallelism wall

Degree of Parallelism
9 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL

Ray tracing
ARMY OF ANTS: SIMT CORES
FOR SIMT (SINGLE-INSTRUCTION-MULTIPLE-THREAD ) EXECUTION

SIMT is the execution model of HSA and
implemented in modern GPUs, with
MIMD flexibility and SIMD efficiency

A SIMT core runs 1 iteration of
the parallel loop
Parallel.For (…)

Front End
Front End
Front End

…
If (…) then

…

Else

…
SPE

SPE

ALU

10 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL

ALU

ALU

A cluster of SIMT cores shares one front end in a SIMD
manner
Specialized Processing Engines
Wider SIMT

ALU
ALU
ALU

ALU
ALU

ALU
ALU
ALU

ALU

ALU
ALU
ALU

ALU

ALU
ALU
ALU

ALU
ALU

ALU

ALU
ALU
ALU

ALU
ALU
ALU

ALU
ALU

ALU
ALU
ALU

ALU
ALU
ALU
ALU

…

A branch is emulated
thru divergence
MASSIVELY PARALLEL WORKLOADS
• Problem size N can keep growing

• Visible serial workload s can be kept constant
• Parallel workload is speeded up by P, the number of cores
• Reduction overhead is proportional to log P (by a factor of r)

• "Embarrassingly" parallel, when there is no reduction overhead (r=0)

s
s

N
r log P

N/P
Time saved by P cores

11 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
REVISITING AMDAHL'S LAW

s1=50%,r=50%
s=50%, r=50%
10000
100

Speedup 
Speedup 

ss  N
P
ss rrlog P  1 / P
 log P N

Speedup

1000
N=16
N=16
N=64
N=64
N=256
N=256
P=N

10
100
10

1
1
2
2
4
4
8
8
16
16
32
32
64
64
128
128
256
256
512
512
1024
1024
2048
4096
8192

1 1
Degree of Parallelism (P)
Degree of Parallelism (P)
12 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
GRAPHICS KEEP MOVING

Highest grossing video
game of all-time bench 2.7 T-Rex
GL benchmark 2.1 Egypt
GFX
Recognized by 94% of
American Consumers
Pac-man, 1980
GL benchmark 2.5 Egypt

GFX bench 3.0 Manhattan

Mobile 3D Graphics
13 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
MEDIATEK FACE BEAUTIFICATION
WHEN IT COMES TO BEAUTY, THERE SEEMS TO BE NO LIMIT

Before

14 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL

Skin tone adjustment
Wrinkle removal

Thinner face, bigger eyes
HIGH-PERFORMANCE COMPUTING (HPC) KEEPS SCALING OUT

More atoms

Top of Top500 1993-2012
1,000,000
100,000
Relative to 1993

 HPC from 1993 to 2012
‒GFLOPS ~ 130,000x
‒Cores ~ 11,000x
‒GHz ~ 10x

Higher grid resolution
More time steps
15 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL

10,000
1,000

GFLOPS
Cores

100

10
1
1990 1995 2000 2005 2010 2015
0

GHz
THE MISSING LINKS
IN SEARCH OF PARALLEL KILLER APPS

Moore’s law

Better user
experience

Higher frequency
More cores

Bigger data
What bigger problems to
solve with bigger data?

How solving bigger problems
leads to better user experience?

More complex
Mining bigger data
Bigger problems
with Machine Learning
software
16 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
MACHINE LEARNING: TREND PREDICTION WITH POWERFUL MODELS
 Powerful models (with many knobs) tend to overfit the noise if the data set is not sufficiently large

350

 The explosive growth of data has made powerful
models feasible

250

 A model with 1 billion knobs, trained with 10
million images from YouTube was used in Google
Brain experiment to figure out the concepts of cats
and human faces by itself

300

200

150
100
50
0
-50

0

2

4

Samples

Data

Linear

Poly. (2nd order)

Poly. (6th order)
Source: Le et al., Building High-level Features Using
Large Scale Unsupervised Learning
17 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL

6th-order polynomial undulates excessively
with only 4 samples

6
HOW TO DISTINGUISH CATS FROM DOGS?
ASIRRA
Animal Species Image Recognition for Restricting Access (from Microsoft Research)

18 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
CAN ASIRRA BE CRACKED?

19 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
WHY IS IT HARD?

Source: training set of Kaggle.com Dogs vs. Cats competition
20 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
IS THERE A MODEL FINDING OUT THAT THESE ARE THE SAME DOG?

Prancer, a 5-years-old toy poodle, before and after grooming
21 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
MINE THE SOLUTIONS FROM THE DATA
Dog-Cat
classifier

Theory of the differences
between dogs and cats?

Learn from many (12,500)
photos labeled as dogs or
cats
Machine Learning

22 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
SMART AND SMARTER CLIENTS IN THE ERA OF BIG DATA

Bigger
Big Data
Data
Smarter Client
Client

Cloud
Bigger Training
Big Training Set
Set

Bigger Machine
Machine Learning
Learning

In the cloud or
the clients
Powerful
Bigger
Model

Better Sensing
Sensing

Input
data
23 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL

Better
Connectivity
Connectivity
Better
Answer
Answer

Local Machine
Learning
PARALLEL COMPUTING IN THE CLOUD AND AT THE CLIENTS
Examples:




dog/cat photos
Sensor readings

x

dog or cat
jogging, walking or driving

f x ai 

y

Model

Cloud Parallel
Computing with
more samples

Samples

( xn , yn )

ai 

Knobs

Tweak ai  to minimize the error between

f xn ai 

and

Model

Machine Learning
24 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL

Client Parallel
Computing with
more knobs

yn
WHY HSA?
Machine learning happens in the
cloud and at the clients
Models run in the cloud or at the
clients
Need same ease of programming
and write-once-run-everywhere
for heterogeneous cores

25 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL

Mediatek is one of the cofounders
of HSA Foundation
MediaTek is the first to introduce in
mobile SoC
 True Octa-Core
 Heterogeneous Multiprocessing
(HMP)
SCALE OUT AND SCALE IN WITH HETEROGENEOUS CORES
• Both the cloud and mobile clients
are limited by power
• Mobile devices need to keep
cool in our palms
• Data centers need to keep
our environment clean
• Carbon footprint of US datacenters is at the same level
as the airline industry
• A 1,000m2 datacenter consumes 1.5MW, enough to
power 1,000 US homes per year

In order to scale out, we need to scale in with
heterogeneous cores in the cloud and in our
palms
26 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL

Typical 1,000 homes in US
BACKUP
THE NEW VIRTUOUS CYCLE
PERHAPS, LEADING TO COMPUTING LIKE OUR BRAIN

Moore’s law and
beyond

Better user
experience

More heterogeneous
cores

Mining bigger data
with Machine Learning
28 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL

Bigger data
MASSIVELY PARALLEL WORKLOADS
• Can keep growing the problem size N

• The serial workload s can be kept constant
• The parallel workload is speeded up by P, the number of cores
• The reduction overhead is proportional to log P (by a factor of r)

• "Embarrassingly" parallel, when there is no reduction overhead (r=0)

s
s

N
r log P

N/P
Time saved by P cores

29 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
THE ELEPHANTS: CPU CORES
FOR MULTIPLE-INSTRUCTION-MULTIPLE-DATA (MIMD) EXECUTION

Retrofitted for moderately parallel
workloads, and not very efficient for
massively parallel workloads

Parallel.For (i)

…
If (…)
Front End
Front End

Front End
Front End

Front End
Front End

Front End
Front End

Front End
Front End

ALU
ALU

ALU
ALU

ALU
ALU

ALU
ALU

ALU
ALU

…

ALU
ALU

Else

Front End
Front End

…

…
A CPU core runs 1 iteration of the parallel loop
The same color means the same piece of code
30 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices,
Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names
are for informational purposes only and may be trademarks of their respective owners.

31 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL

Más contenido relacionado

Más de AMD Developer Central

The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...AMD Developer Central
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellAMD Developer Central
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornAMD Developer Central
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...AMD Developer Central
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14AMD Developer Central
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...AMD Developer Central
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14AMD Developer Central
 
Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14AMD Developer Central
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14AMD Developer Central
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahAMD Developer Central
 
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...AMD Developer Central
 

Más de AMD Developer Central (20)

The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14
 
Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
 
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
 

Último

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Último (20)

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

Keynote (Dr Chien-Ping Lu) - How Many Cores Will We Need? - by Dr Chien-Ping Lu, Sr Director – Corporate Technology Office, MediaTek USA Inc.

  • 1. HOW MANY CORES WILL WE NEED? IN SEARCH OF PARALLEL KILLER APPS CHIEN-PING LU, PHD MEDIATEK INC
  • 2. A GROUP OF HIPPOS IS CALLED … A Crash 2 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
  • 3. A GROUP OF CROWS IS CALLED … A Murder 3 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
  • 4. A GROUP OF GIRAFFES IS CALLED … From Wikipedia A Tower 4 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
  • 5. SO, IT IS NOT SURPRISING THAT WE USE “A Parade” of elephants 5 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL “A Herd” of sheep “An Army” of ants
  • 6. FROM FREQUENCY TO MULTICORE SCALING Power Frequency performance Power Single-core Time 6 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL Multi-core Power wall: 2005
  • 7. IT SEEMS INEVITABLE THAT WE WILL NEED A MASSIVE NUMBER OF CORES performance Moderate Time 7 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL Massive
  • 8. DARK SILICON (OR DARK CORES)? performance 8x  4x 4x  3x 2x Time 8 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL 16x  4x
  • 9. HOW TO LIGHT UP THE CORES? Redefine the cores to be heterogeneous Search for parallel killer apps power Power ceiling SIMT “cores” Little cores H.264 encoding Big cores Parallelism wall Degree of Parallelism 9 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL Ray tracing
  • 10. ARMY OF ANTS: SIMT CORES FOR SIMT (SINGLE-INSTRUCTION-MULTIPLE-THREAD ) EXECUTION SIMT is the execution model of HSA and implemented in modern GPUs, with MIMD flexibility and SIMD efficiency A SIMT core runs 1 iteration of the parallel loop Parallel.For (…) Front End Front End Front End … If (…) then … Else … SPE SPE ALU 10 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL ALU ALU A cluster of SIMT cores shares one front end in a SIMD manner Specialized Processing Engines Wider SIMT ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU … A branch is emulated thru divergence
  • 11. MASSIVELY PARALLEL WORKLOADS • Problem size N can keep growing • Visible serial workload s can be kept constant • Parallel workload is speeded up by P, the number of cores • Reduction overhead is proportional to log P (by a factor of r) • "Embarrassingly" parallel, when there is no reduction overhead (r=0) s s N r log P N/P Time saved by P cores 11 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
  • 12. REVISITING AMDAHL'S LAW s1=50%,r=50% s=50%, r=50% 10000 100 Speedup  Speedup  ss  N P ss rrlog P  1 / P  log P N Speedup 1000 N=16 N=16 N=64 N=64 N=256 N=256 P=N 10 100 10 1 1 2 2 4 4 8 8 16 16 32 32 64 64 128 128 256 256 512 512 1024 1024 2048 4096 8192 1 1 Degree of Parallelism (P) Degree of Parallelism (P) 12 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
  • 13. GRAPHICS KEEP MOVING Highest grossing video game of all-time bench 2.7 T-Rex GL benchmark 2.1 Egypt GFX Recognized by 94% of American Consumers Pac-man, 1980 GL benchmark 2.5 Egypt GFX bench 3.0 Manhattan Mobile 3D Graphics 13 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
  • 14. MEDIATEK FACE BEAUTIFICATION WHEN IT COMES TO BEAUTY, THERE SEEMS TO BE NO LIMIT Before 14 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL Skin tone adjustment Wrinkle removal Thinner face, bigger eyes
  • 15. HIGH-PERFORMANCE COMPUTING (HPC) KEEPS SCALING OUT More atoms Top of Top500 1993-2012 1,000,000 100,000 Relative to 1993  HPC from 1993 to 2012 ‒GFLOPS ~ 130,000x ‒Cores ~ 11,000x ‒GHz ~ 10x Higher grid resolution More time steps 15 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL 10,000 1,000 GFLOPS Cores 100 10 1 1990 1995 2000 2005 2010 2015 0 GHz
  • 16. THE MISSING LINKS IN SEARCH OF PARALLEL KILLER APPS Moore’s law Better user experience Higher frequency More cores Bigger data What bigger problems to solve with bigger data? How solving bigger problems leads to better user experience? More complex Mining bigger data Bigger problems with Machine Learning software 16 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
  • 17. MACHINE LEARNING: TREND PREDICTION WITH POWERFUL MODELS  Powerful models (with many knobs) tend to overfit the noise if the data set is not sufficiently large 350  The explosive growth of data has made powerful models feasible 250  A model with 1 billion knobs, trained with 10 million images from YouTube was used in Google Brain experiment to figure out the concepts of cats and human faces by itself 300 200 150 100 50 0 -50 0 2 4 Samples Data Linear Poly. (2nd order) Poly. (6th order) Source: Le et al., Building High-level Features Using Large Scale Unsupervised Learning 17 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL 6th-order polynomial undulates excessively with only 4 samples 6
  • 18. HOW TO DISTINGUISH CATS FROM DOGS? ASIRRA Animal Species Image Recognition for Restricting Access (from Microsoft Research) 18 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
  • 19. CAN ASIRRA BE CRACKED? 19 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
  • 20. WHY IS IT HARD? Source: training set of Kaggle.com Dogs vs. Cats competition 20 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
  • 21. IS THERE A MODEL FINDING OUT THAT THESE ARE THE SAME DOG? Prancer, a 5-years-old toy poodle, before and after grooming 21 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
  • 22. MINE THE SOLUTIONS FROM THE DATA Dog-Cat classifier Theory of the differences between dogs and cats? Learn from many (12,500) photos labeled as dogs or cats Machine Learning 22 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
  • 23. SMART AND SMARTER CLIENTS IN THE ERA OF BIG DATA Bigger Big Data Data Smarter Client Client Cloud Bigger Training Big Training Set Set Bigger Machine Machine Learning Learning In the cloud or the clients Powerful Bigger Model Better Sensing Sensing Input data 23 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL Better Connectivity Connectivity Better Answer Answer Local Machine Learning
  • 24. PARALLEL COMPUTING IN THE CLOUD AND AT THE CLIENTS Examples:   dog/cat photos Sensor readings x dog or cat jogging, walking or driving f x ai  y Model Cloud Parallel Computing with more samples Samples ( xn , yn ) ai  Knobs Tweak ai  to minimize the error between f xn ai  and Model Machine Learning 24 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL Client Parallel Computing with more knobs yn
  • 25. WHY HSA? Machine learning happens in the cloud and at the clients Models run in the cloud or at the clients Need same ease of programming and write-once-run-everywhere for heterogeneous cores 25 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL Mediatek is one of the cofounders of HSA Foundation MediaTek is the first to introduce in mobile SoC  True Octa-Core  Heterogeneous Multiprocessing (HMP)
  • 26. SCALE OUT AND SCALE IN WITH HETEROGENEOUS CORES • Both the cloud and mobile clients are limited by power • Mobile devices need to keep cool in our palms • Data centers need to keep our environment clean • Carbon footprint of US datacenters is at the same level as the airline industry • A 1,000m2 datacenter consumes 1.5MW, enough to power 1,000 US homes per year In order to scale out, we need to scale in with heterogeneous cores in the cloud and in our palms 26 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL Typical 1,000 homes in US
  • 28. THE NEW VIRTUOUS CYCLE PERHAPS, LEADING TO COMPUTING LIKE OUR BRAIN Moore’s law and beyond Better user experience More heterogeneous cores Mining bigger data with Machine Learning 28 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL Bigger data
  • 29. MASSIVELY PARALLEL WORKLOADS • Can keep growing the problem size N • The serial workload s can be kept constant • The parallel workload is speeded up by P, the number of cores • The reduction overhead is proportional to log P (by a factor of r) • "Embarrassingly" parallel, when there is no reduction overhead (r=0) s s N r log P N/P Time saved by P cores 29 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
  • 30. THE ELEPHANTS: CPU CORES FOR MULTIPLE-INSTRUCTION-MULTIPLE-DATA (MIMD) EXECUTION Retrofitted for moderately parallel workloads, and not very efficient for massively parallel workloads Parallel.For (i) … If (…) Front End Front End Front End Front End Front End Front End Front End Front End Front End Front End ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU … ALU ALU Else Front End Front End … … A CPU core runs 1 iteration of the parallel loop The same color means the same piece of code 30 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL
  • 31. DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. 31 | HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL