SlideShare una empresa de Scribd logo
1 de 85
V. Milutinovic, G. Rakocevic, S. Stojanovic, and Z. Sustran
University of Belgrade

Oskar Mencer
Imperial College, London

Oliver Pell

Maxeler Technologies, London and Palo Alto

Michael Flynn
Stanford University, Palo Alto, USA

Valentina Balas
Aurel Vlaicu University of Arad, Romania, Maxeler Ambassador

1/83
An Alternative Title:
How to hire
more than 1000 PhD students
at no additional cost
to tax payers?
For Big Data algorithms
and for the same hardware price as before,
achieving:
a) speed-up, 20-200
b) monthly electricity bills, reduced 20 times
c) size, 20 times smaller
The major issues of engineering are: design cost and design complexity.
Remember, economy has its own rules: production count and market demand!
3/83
Elaboration :)
If a computer center spends E50M/year on electricity bills,
and moves most of its time-consuming algorithms to Maxeler,
which uses 20 times less power,
the yearly spending drops down to E2.5M,
and E47.5M is saved to tax payers :)
If the average net salary of a PHD student in Germany is E1500,
and if the overhead factor is 1.00,
it is easy to calculate
that E47.5M can pay 2611 PHD students to work for one year,
and that can go year after year :)
If the overhead factor is 2.611
(I do not know how big it is, but it is less than 2.611, for sure),
one can hire 1000 PHD students, at no additional cost :)
1. Over 95% of run time in loops
2.
3.

4.
5.
6.

[loops to almost zero]
Reusability of data (e.g., x+x2+x3+x4+…)
[how close to zero?]
BigData
[prog: for data streaming, not for data
control]
Latency
A new programming model
WORM [prog.effort+comp.tim]
Use a tractor, not a Ferrari, to drive over a plowed field

5/83
Absolutely all results achieved in Europe:
a) All hardware produced in Europe,
specifically UK
b) All software generated by programmers
of EU and WB

6/83
ControlFlow (MultiFlow and ManyFlow):
 Top500 ranks using Linpack

(Japanese K, IBM Sequoya, Cray Titan, …)

DataFlow:
 Coarse Grain (HEP) vs. Fine Grain (Maxeler)
The history starts in 1960's!
The enabler technology did not exist before the year 2000!

7/83
Compiling below the machine code level brings speedups;
also a smaller power, size, and cost.
The price to pay:
The machine is more difficult to program.
Consequently:
Ideal for WORM applications :)
Examples using Maxeler:
GeoPhysics (20-200), Banking (200-2000, with JP Morgan 20%),
M&C (New York City), Datamining (Google), …
8/83
Simulator builder
Hardware builder
9
2n+3
10/83
Why Java? Minimal Kolmogorov Complexity, etc…

11/83
12
13
tCPU =
N * NOPS * CCPU*TclkCPU
/NcoresCPU

tGPU =
N * NOPS * CGPU*TclkGPU /
NcoresGPU

tDF = NOPS * CDF * TclkDF +
(N – 1) * TclkDF / NDF

Assumptions:
1. Software includes enough parallelism to keep all cores busy
2. The only limiting factor is the number of cores.

14/83
DualCore?

Which way are the horses
going?

15/83
Is it possible

to use 2000 chicken instead of two horses?

?
==

What is better, real and anecdotic?
16/83
2 x 1000 chickens (CUDA and rCUDA)
17/83
at a
D

How about 2 000 000
ants?
18/83
Big Data Input

Results

Marmalade

19/83
Factor: 20 to 200
MultiCore/ManyCore

Dataflow

Machine Level Code

Gate Transfer Level
20/83
Factor: 20
MultiCore/ManyCore

Dataflow

21/83
Factor: 20
MultiCore/ManyCore

DataFlow

Data Processing

Data Processing
Process Control

Process Control

22/83
 MultiCore:
 Explain what to do, to the driver
 Caches, instruction buffers, and predictors needed

 ManyCore:
 Explain what to do, to many sub-drivers
 Reduced caches and instruction buffers needed

 DataFlow:
 Make a field of processing gates: 1C+2nJava+3Java
 No caches, etc. (300 students/year: BGD, BCN, LjU, ICL,…)
23/83
MultiCore:
 Business as usual

ManyCore:
 More difficult

DataFlow:
 Much more difficult
 Debugging both, application and configuration code

24/83
 MultiCore/ManyCore:
 Several minutes

 DataFlow:

 Several hours for the real hardware
 Fortunately, only several minutes for the simulator,

several seconds for reload (90% due to DRAM inertia),
and several milliseconds to restart
 The simulator supports
both the large JPMorgan machine
as well as the smallest “University Support” machine
 Good news:

 Tabula@2GHz
25/83
26/83
MultiCore:
 Horse stable

ManyCore:
 Chicken house

DataFlow:
 Ant hole

27/83
MultiCore:
 Haystack

ManyCore:
 Cornbits

DataFlow:
 Crumbs

28/83
Small Data: Toy Benchmarks (e.g., Linpack)
29/83
Medium Data
(benchmarks
favorising NVidia,
compared to Intel,…)

30/83
Big Data

31/83
Maxeler Hardware

CPUs plus DFEs
Intel Xeon CPU cores and up to
4 DFEs with 192GB of RAM

DFEs shared over Infiniband
Up to 8 DFEs with 384GB of
RAM and dynamic allocation
of DFEs to CPU servers

MaxWorkstation
Desktop development system

32/83

Low latency connectivity
Intel Xeon CPUs and 1-2 DFEs
with up to six 10Gbit Ethernet
connections

MaxCloud
On-demand scalable accelerated
compute resource, hosted in London
Major Classes of Algorithms,
from the Computational Perspective
1. Coarse grained, stateful: Business
– CPU requires DFE for minutes or hours
– Interrupts

1. Fine grained, transactional with shared database: DM
– CPU utilizes DFE for ms to s
– Many short computations, accessing common database data

1. Fine grained, stateless transactional: Science (Phy, ...)
– CPU requires DFE for ms to s
– Many short computations

33/83
Coarse Grained: Modeling

34/83

Timesteps (thousand)

70
60

Domain points (billion)

50

Total computed points (trillion)

40
30
20
10
0
0

10

20

30
40
50
Peak Frequency (Hz)

60

70

2,000
1,800

15Hz peak frequency

1,600

30Hz peak frequency

1,400

45Hz peak frequency

1,200

70Hz peak frequency

1,000
800
600

s
r
o
c
U
P
C
t
n
e
l
a
v
i
u
q
E

• Long runtime, but:
• Memory requirements
change dramatically based
on modelled frequency
• Number of DFEs allocated
to a CPU process can be
easily varied to increase
available memory
• Streaming compression
• Boundary data exchanged
over chassis MaxRing

80

400
200
0
1

4
Number of MAX2 cards

8

80
Fine Grained, Shared Data: Monitoring
• DFE DRAM contains the database to be searched
• CPUs issue transactions find(x, db)
• Complex search function
– Text search against documents
– Shortest distance to coordinate (multi-dimensional)
– Smith Waterman sequence alignment for genomes

• Any CPU runs on any DFE
that has been loaded with the database
– MaxelerOS may add or remove DFEs
from the processing group to balance system demands
– New DFEs must be loaded with the search DB before use
35/83
Fine Grained, Stateless: The BSOP Control
•
•
•
•

Analyse > 1,000,000 scenarios
Many CPU processes run on many DFEs
≈50x MPC-X vs. multi-core x86 node
Each transaction executes on any DFE
in the assigned group atomically
CPU
CPU
CPU
CPU
CPU

Market and
instruments
data

Tail
Tail
Tail
Tail
Tail
Tail
Tail
analysis
Tail
analysis
Tail
analysis
Tail
analysis
analysis
analysis
analysis
onCPU
CPU
analysis
onCPU
analysis CPU
onCPU
analysis
onCPU
onCPU
on
on CPU
on
on CPU
on CPU
Instrument
values

36/83

DFE
DFE
DFE
DFE
DFE

Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Random number
Random number
Random number
Random number
Random number
Random number
generator and
generator
Random numberand
Random number
generator and
generator
Random numberand
Random number
generator and
generator and
sampling of and
sampling underliers
generator and
generator underliers
sampling of of underliers
sampling underliers
generator and
generator and
sampling of of underliers
sampling underliers
sampling of of underliers
sampling underliers
sampling of of underliers
sampling of underliers
Price instruments
Price instruments
Price instruments
Price instruments
Price instruments
Priceusing Black
instruments
using Black
Price instruments
Priceusing Black
instruments
using Black
Price instruments
Priceusing Scholes
instruments
Black
using Black
Scholes
using Scholes
Black
using Black
Scholes
using Scholes
Black
using Black
Scholes
Scholes
Scholes
Scholes
Scholes
Selected Examples:
Business,
Mathematics,
GeoPhysics, etc.
37/83
38
An MIS Example: Credit
Derivatives
Orbital station

Climber

Tether

HW
41
Seismic Imaging

• Running on MaxNode servers
- 8 parallel compute pipelines per chip
- 150MHz => low power consumption!
- 30x faster than microprocessors

An Implementation of the Acoustic Wave Equation on FPGAs
T. Nemeth†, J. Stefani†, W. Liu†, R. Dimond‡, O. Pell‡, R.Ergas§
†
Chevron, ‡Maxeler, §Formerly Chevron, SEG 2008
42/83
The CRS Results


Performance of one MAX2 card vs. 1 CPU core


Land case (8 params), speedup of 230x



Marine case (6 params), speedup of 190x
CPU Coherency

43/83

MAX2 Coherency
44
46
466/83
4
P. Marchetti et al, 2010

Trace Stacking: Speed-up 217
• DM for Monitoring and Control in Seismic processing
• Velocity independent / data driven method
to obtain a stack of traces, based on 8 parameters
• Search for every sample of each output trace
2

t

2
hyp


2 T  2t0 T
=  t0 + w m  +
m H zy K N H T m + h T H zy K NIP H T h
zy
zy


v0
v0



(

2 parameters ( emergence angle & azimuth )
3 Normal Wave front parameters ( KN,11; KN,12 ; KN22 )
3 NIP Wave front parameters ( KNip,11; KNip,12 ; KNip22 )
47/83

)
Maxeler running Smith Waterman

48
Molecular Correlates of Tumor Signatures
from a Large Cohort
From whole slide sections, of a cohort,
to pathway analysis (Prof Bahram Parvin,
Berkeley)

High Content Analysis (HCA) on MPC-X
51
Conclusion: Nota Bene
This is about algorithmic changes,
to maximize
the algorithm to architecture match:
algorithmic modifications,
pipeline utilization,
data choreography,
and
decision making precision.
The winning paradigm of Big Data ExaScale?

52/83
Algorithmic Changes: Data Dependencies
PSI[0]

…

PSI[1]

OP

cbeta[0]

OP

cbeta[1]

PSI[N-3]

OP

…

…

0

OP’

OP’

…

PSI[0]

PSI[1]

PSI[2]

…

PSI[N-2]

PSI[N-1]

OP

cbeta[N-3]

OP’

PSI[N-3]

Example generated by Sasa Stojanovic (Gross-Pitaevskii)

cbeta[N-2]

OP’

0

PSI[N-2]

PSI[N-1]

53/83
Pipeline Changes: Higher Efficiency
0
X[0,0]
X[0,1]
[0,0]

0

[0,1]
[7,0]
[7,0]
[6,0]
[6,0]
[5,0]
[5,0]
[4,0]
[4,0]
[3,0]
[3,0]
[2,0]
[2,0]
[1,0]
[1,0]
[0,0]

R[0,0]

R[0,0]

Example generated by Sasa Stojanovic (Gross-Pitaevskii)
54/83
Data Recoreography: Pipeline Utilization
Example generated by Sasa Stojanovic (Gross-Pitaevskii)

Order of data accesses
inside of a burst

…

…

…

55/83
Fixed Point: Savings Reinvestable
• Consider fixed point
compared to single precision floating point
• If the range is tightly confined,
one could use 24-bit fixed point
• If data has a wider range, may need 32-bit fixed point
hwFloat(8,24) hwFix(24,...)
Add
Multiply

hwFix(32,...)

500 LUTs

24 LUTs

32 LUTs

2 DSPs

2 DSPs

4 DSPs

• Arithmetic is not 100% of the chip.
In practice, often ~5x performance boost from fixed point.
56
 Revisiting the Top 500 SuperComputers benchmarks
 Our paper in Communications of the ACM

 Revisiting all major Big Data DM algorithms

 Massive static parallelism at low clock frequencies

 Concurrency and communication

 Concurrency between millions of tiny cores difficult,

“jitter” between cores will harm performance
at synchronization points

 Reliability and fault tolerance

 10-100x fewer nodes, failures much less often

 Memory bandwidth and FLOP/byte ratio

 Optimize data choreography, data movement,

and the algorithmic computation

 New architecture of n-Programming paradigms
57/83
FP7: RoMoL@BCN

The SAB goal: Out of box thinking!
58/83
FP7: BalCon@SRB

The vision of Alkis Konstantellos

The SAB goal: Seed for new proposals!
59/83
DAFNE: Leader MISANU

60/83
DAFNE = South (MaxCode) + North
(BigData)
MISANU, IMP, KG, NS,
UK
BSC, UPV,
Sweden
U of Siena, U of Roma,
Norway
IJS, FRI,
Denmark
Germany
IRB,
France
QPLAN,
Bogazici, U of Istanbul,
Austria
U of Bucharest, U of Arad,
Swiss
U of Tuzla,
Poland
Technion, Maxeler Israel, IPSI
Hungary
61/83
61/83
The DAFNE Map

62/83
The TriPeak @
DATAMAN

Siena
+ BSC
+ Imperial College
+ Maxeler
+ Belgrade

63/83
46/83
The TriPeak: Essence
MontBlanc = A ManyCore (NVidia) + a MultiCore (ARM)
Maxeler = A FineGrain DataFlow (FPGA)
How about a happy marriage?
MontBlanc (ompSS) and Maxeler (an accelerator)
In each happy marriage,
it is known who does what :)
The Big Data DM algorithms:
What part goes to MontBlanc and what to Maxeler?
64/83
64/83
TriPeak: Core of the Symbiotic
Success
An intelligent DM algorithmic scheduler,
partially implemented for compile time,
and partially for run time.
At compile time:
Checking what part of code fits where
(MontBlanc or Maxeler): LoC 1M vs 2K vs 20K
At run time:
Rechecking the compile time decision,
based on the current data values.
65/83
65/83
66
66/83
Maxeler: Research (Google: good
method)

Structure of a Typical Research Paper: Scenario #1
[Comparison of Platforms for One Algorithm]
Curve A: MultiCore of approximately the same PurchasePrice
Curve B: ManyCore of approximately the same PurchasePrice
Curve C: Maxeler after a direct algorithm migration
Curve D: Maxeler after algorithmic improvements
Curve E: Maxeler after data choreography
Curve F: Maxeler after precision modifications

Structure of a Typical Research Paper: Scenario #2
[Ranking of Algorithms for One Application]
CurveSet A: Comparison of Algorithms on a MultiCore
CurveSet B: Comparison of Algorithms on a ManyCore
CurveSet C: Comparison on Maxeler, after a direct algorithm migration
CurveSet D: Comparison on Maxeler, after algorithmic improvements
CurveSet E: Comparison on Maxeler, after data choreography
CurveSet F: Comparison on Maxeler, after precision modifications

67/83
67/83
Maxeler Research in Serbia:
Special Issue of IPSI Transactions
Journal
KG: Blood Flow, Tijana Djukic and Prof. Filipovic

NS: Combinatorial Math, Prof. Senk and Ivan Stanojevic
MISANU: The SAT Math, Zivojin Sustran and Prof. Ognjanovic
ETF: Meteorology, Radomir Radojicic and Marko Stankovic
ETF: Physics (Gross Pitaevskii 3D real), Sasa Stojanovic
ETF: Physics (Gross Pitaevskii 3D imaginary), Lena Parezanovic
68/83
68/83
Maxeler Research WorldWide:
Special Issue of Advances in Computers @ SCI

Stanford, Texas,
Imperial, Maxeler,
ETF, MF, MISANU, IMP, KG, NS,
BSC, UPV,
U of Siena, U of Roma,
IJS, FRI, …

69/83
69/83
© H. Maurer

70
70/83
Maxeler: Teaching (Google: prof
vm) VLSI, PowerPoints, Maxeler:
TEACHING,
Maxeler Veljko Explanations, August 2012
Maxeler Veljko Anegdotic,
Maxeler Oskar Talk, August 2012
Maxeler Forbes Article
Flyer by JP Morgan
Flyer by Maxeler HPC
Tutorial Slides by Sasha and Veljko: Practice (Current Update)
Paper, unconditionally accepted for Advances in Computers by Elsevier
Paper, unconditionally accepted for Communications of the ACM
Tutorial Slides by Oskar: Theory (7 parts)
Slides by Jacob, New York
Slides by Jacob, Alabama
Slides by Sasha: Practice (Current Update)
Maxeler in Meteorology
Maxeler in Mathematics
Examples generated in Belgrade and Worldwide
THE COURSE ALSO INCLUDES DARPA METHODOLOGY FOR MICROPROCESSOR DESIGN,
with an example
71/83
71/83
Maxeler PreConference Tutorials (2013)
Google:
IEEE HiPeak, Berlin, Germany, January 2013
ACM iSAC, Coimbra, Portugal, March 2013
IEEE MECO, Budva, Montenegro, June 2013
ACM ISCA, Tel Aviv, Israel, June 2013

72/83
72/83
Maxeler InHouse Tutorials (2013)

73/83
73/83
© H. Maurer

74
74/83
Maxeler University Program Members

75/83
How to Become a Family Member?
Options to consider:
a. MAX-UP free of charge
b. Purchasing a university-level machine
(min about $10K)
c. Purchasing a JPM-level machine
(slowly approaching $100M),
or at least a Schlumberger-level machine
(slowly moving above $10M)
76/83
76/83
Good to Know!

Maxeler employs close to 100 people, GBR and USA:
a. Maxeler cash burn per year = about $10M
b. If a university-level machine is sold at the 100% profit margin,
the company life of Maxeler is extended for about 2 hours.
c. If a university-level machine is sold at the 1% profit margin,
the company life of Maxeler is extended for 1 minute.
Our past or ongoing FP7 projects requiring Maxeler speeds:
a. ProSense
b. ARTreat
c. HiPEAC

77/83
77/83
The Educational Mission
Important note:
a. Total number of accredited universities in the whole world?
b. As per WeboMetrics, about 20000.
c. Consequently, all universities of the world together bring only:
20000 minutes of extra life, or about two weeks of extra life.
The reality:
a. University-level machines are sold at the ZERO profit margin!
b. Only the Xilinx costs, handling, and shipping.
c. Email support for student doing thesis is practically unlimited!
Conclusion: This is a chance for those who jump in first :)
78/83
78/83
Our Work Impacting Maxeler
Milutinovic, V., Knezevic, P., Radunovic, B., Casselman, S., Schewel, J., Obelix
Searches Internet Using Customer Data, IEEE COMPUTER, July 2000 (impact
factor 2.205/2010).
Milutinovic, V., Cvetkovic, D., Mirkovic, J., Genetic Search Based on Multiple
Mutation Approaches, IEEE COMPUTER, November 2000 (impact factor
2.205/2010).
Milutinovic, V., Ngom, A., Stojmenovic, I., STRIP --- A Strip Based Neural Network
Growth Algorithm for Learning Multiple-Valued Functions, IEEE TRANSACTIONS
ON NEURAL NETWORKS, March 2001, Vol.12, No.2, pp. 212-227.
Jovanov, E., Milutinovic, V., Hurson, A., Acceleration of Nonnumeric Operations
Using Hardware Support for the Ordered Table Hashing Algorithms, IEEE
TRANSACTIONS ON COMPUTERS, September 2002, Vol.51, No.9, pp. 1026-1040
(impact factor 1.822/2010).
79/83
79/83
Maxeler Impacting Our Work
Tafa, Z., Rakocevic, G., Mihailovic, Dj., Milutinovic, V., Effects of
Interdisciplinary Education On Technology-driven Application Design IEEE
Transactions on Education, August 2011, pp.462-470. (impact factor
1.328/2010).
Tomazic, S., Pavlovic, V., Milovanovic, J., Sodnik, J., Kos, A., Stancin, S.,
Milutinovic, V., Fast File Existence Checking in Archiving Systems ACM
Transactions on Storage (TOS) TOS Homepage archive, Volume 7 Issue 1,
June 2011, ACM New York, NY, USA.
Jovanovic, Z., Milutinovic, V., FPGA Accelerator for Floating-Point Matrix
Multiplication, IEE Computers & Digital Techniques, 2012, 6, (4), pp. 249256.
Flynn, M., Mencer, O., Milutinovic, V., Rakocevic, G., Stenstrom, P., Trobec,
R., and Valero, M., Moving from Petaflops (on Simple Benchmarks) to
Petadata per Unit of Time and Power (On Sophisticated Benchmarks)
Communications of the ACM, May 2013 (impact factor 1.919/2010).
80/83
80/83
Current Main Efforts of Maxeler
1. To encourage a lot of software to be written/ported.
This is a key business opportunity that needs to be developed.
2. Maxeler is building up a website and a community
to share software for DFEs.
This would allow the software to also be sold
directly from the Maxeler website.
3. If a PhD student ports an important software
to a Maxeler machine,
she/he could become the first software vendor in the world
for dataflow computers,
and Maxeler would be happy to help sell licenses.
81/83
Current Side Efforts of Maxeler
1. Developing new tools for easier making of kernels.
2. Bringing new languages to Maxeler:
C, C++, MathLab, Matematika
3. Porting popular application packages to Maxeler:
OpenSees, etc...
4. Trying the Tabula FPGA!
5. Getting more than 1TeraByte/sec thru I/O
6. Minimizing the hardware, so it can go into Galaxy 5,6…
82/83
NewTools: MaxSkins
Custom Engine
Interfaces
(.c)

MaxCompiler

.max file

Testing /
Application integration

Dataflow Design
(.maxj)

MaxCompiler
App Packager

.max file developer
.max file user

App
Installer

SLiC level programming MATLAB

.mex

.m

C/C++

R

Excel

83

Python

83/83
Getting Started a Practical Work
from the Linux Shell
1. Open a shell terminal (e.g., $ /usr/bin/xfce4-terminal).
2. Connect to the Maxeler machine
(e.g., $ ssh root@147.91.12.216).
3. If more shell screens needed, start screen (e.g., $ screen).
4. Switch to the directory that contains
the 2n+3 programs you wrote
(e.g., $ cd Desktop/workspace/src/ind/z88/).
5. Prepare your C code for measuring the execution time
(e.g., clock_gettime(CLOCK_REALTIME, &t2);).
6. See what you can do (e.g., $ make).
7. Select one of those that you can do
(e.g., $ make build-sim, $ make run-sim,
$ make build-hw, $ make run-hw).
8. Measure the power consumption at the wall plug.
84/83
Q&A

vm@etf.rs

© H. Maurer

85
85/83

Más contenido relacionado

La actualidad más candente

GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)Fatima Qayyum
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoSciCompIIT
 
Feeding the Multicore Beast:It’s All About the Data!
Feeding the Multicore Beast:It’s All About the Data!Feeding the Multicore Beast:It’s All About the Data!
Feeding the Multicore Beast:It’s All About the Data!Slide_N
 
Supercomputer - Overview
Supercomputer - OverviewSupercomputer - Overview
Supercomputer - OverviewARINDAM ROY
 
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...Ganesan Narayanasamy
 
The Rise of Small Satellites
The Rise of Small SatellitesThe Rise of Small Satellites
The Rise of Small Satellitesmooctu9
 
Miniaturizing Space: Small-satellites
Miniaturizing Space: Small-satellitesMiniaturizing Space: Small-satellites
Miniaturizing Space: Small-satellitesX. Breogan COSTA
 
Effective machine learning_with_tpu
Effective machine learning_with_tpuEffective machine learning_with_tpu
Effective machine learning_with_tpuAthul Suresh
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIinside-BigData.com
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensOscar Law
 
Tesla personal super computer
Tesla personal super computerTesla personal super computer
Tesla personal super computerPriya Manik
 

La actualidad más candente (20)

Super Computer
Super ComputerSuper Computer
Super Computer
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric Into
 
Supercomputers
SupercomputersSupercomputers
Supercomputers
 
Feeding the Multicore Beast:It’s All About the Data!
Feeding the Multicore Beast:It’s All About the Data!Feeding the Multicore Beast:It’s All About the Data!
Feeding the Multicore Beast:It’s All About the Data!
 
Supercomputer - Overview
Supercomputer - OverviewSupercomputer - Overview
Supercomputer - Overview
 
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
 
The Rise of Small Satellites
The Rise of Small SatellitesThe Rise of Small Satellites
The Rise of Small Satellites
 
Memoryhierarchy
MemoryhierarchyMemoryhierarchy
Memoryhierarchy
 
Nbvtalkatjntuvizianagaram
NbvtalkatjntuvizianagaramNbvtalkatjntuvizianagaram
Nbvtalkatjntuvizianagaram
 
Miniaturizing Space: Small-satellites
Miniaturizing Space: Small-satellitesMiniaturizing Space: Small-satellites
Miniaturizing Space: Small-satellites
 
supercomputer
supercomputersupercomputer
supercomputer
 
cnsm2011_slide
cnsm2011_slidecnsm2011_slide
cnsm2011_slide
 
Introduction to SLURM
Introduction to SLURMIntroduction to SLURM
Introduction to SLURM
 
Effective machine learning_with_tpu
Effective machine learning_with_tpuEffective machine learning_with_tpu
Effective machine learning_with_tpu
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware Challegens
 
Tesla personal super computer
Tesla personal super computerTesla personal super computer
Tesla personal super computer
 
Super computer 2017
Super computer 2017Super computer 2017
Super computer 2017
 

Similar a Anegdotic Maxeler (Romania)

Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010TELECOM I+D
 
Data flow super computing valentina balas
Data flow super computing   valentina balasData flow super computing   valentina balas
Data flow super computing valentina balasValentina Emilia Balas
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaFacultad de Informática UCM
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE
 
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...Codemotion
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...EUDAT
 
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale FrontierMultiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontierinside-BigData.com
 
Semiconductor overview
Semiconductor overviewSemiconductor overview
Semiconductor overviewNabil Chouba
 
Energy Efficient Computing - 26mar13
Energy Efficient Computing - 26mar13Energy Efficient Computing - 26mar13
Energy Efficient Computing - 26mar13Ian Phillips
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceLEGATO project
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Ls catalog thiet bi tu dong master p 5000-e_dienhathe.vn
Ls catalog thiet bi tu dong master p 5000-e_dienhathe.vnLs catalog thiet bi tu dong master p 5000-e_dienhathe.vn
Ls catalog thiet bi tu dong master p 5000-e_dienhathe.vnDien Ha The
 
Ls catalog thiet bi tu dong master p 5000-e
Ls catalog thiet bi tu dong master p 5000-eLs catalog thiet bi tu dong master p 5000-e
Ls catalog thiet bi tu dong master p 5000-eDien Ha The
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the ContinuumIan Foster
 
Qcom XR Workshop Sept 2020
Qcom XR Workshop Sept 2020Qcom XR Workshop Sept 2020
Qcom XR Workshop Sept 2020Eiko Seidel
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 

Similar a Anegdotic Maxeler (Romania) (20)

Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010
 
Data flow super computing valentina balas
Data flow super computing   valentina balasData flow super computing   valentina balas
Data flow super computing valentina balas
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
 
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
 
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale FrontierMultiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
 
Semiconductor overview
Semiconductor overviewSemiconductor overview
Semiconductor overview
 
Energy Efficient Computing - 26mar13
Energy Efficient Computing - 26mar13Energy Efficient Computing - 26mar13
Energy Efficient Computing - 26mar13
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Ls catalog thiet bi tu dong master p 5000-e_dienhathe.vn
Ls catalog thiet bi tu dong master p 5000-e_dienhathe.vnLs catalog thiet bi tu dong master p 5000-e_dienhathe.vn
Ls catalog thiet bi tu dong master p 5000-e_dienhathe.vn
 
Ls catalog thiet bi tu dong master p 5000-e
Ls catalog thiet bi tu dong master p 5000-eLs catalog thiet bi tu dong master p 5000-e
Ls catalog thiet bi tu dong master p 5000-e
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
Qcom XR Workshop Sept 2020
Qcom XR Workshop Sept 2020Qcom XR Workshop Sept 2020
Qcom XR Workshop Sept 2020
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
OPAL-RT Webinar - HYPERSIM
OPAL-RT Webinar - HYPERSIMOPAL-RT Webinar - HYPERSIM
OPAL-RT Webinar - HYPERSIM
 

Último

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Anegdotic Maxeler (Romania)

  • 1. V. Milutinovic, G. Rakocevic, S. Stojanovic, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies, London and Palo Alto Michael Flynn Stanford University, Palo Alto, USA Valentina Balas Aurel Vlaicu University of Arad, Romania, Maxeler Ambassador 1/83
  • 2. An Alternative Title: How to hire more than 1000 PhD students at no additional cost to tax payers?
  • 3. For Big Data algorithms and for the same hardware price as before, achieving: a) speed-up, 20-200 b) monthly electricity bills, reduced 20 times c) size, 20 times smaller The major issues of engineering are: design cost and design complexity. Remember, economy has its own rules: production count and market demand! 3/83
  • 4. Elaboration :) If a computer center spends E50M/year on electricity bills, and moves most of its time-consuming algorithms to Maxeler, which uses 20 times less power, the yearly spending drops down to E2.5M, and E47.5M is saved to tax payers :) If the average net salary of a PHD student in Germany is E1500, and if the overhead factor is 1.00, it is easy to calculate that E47.5M can pay 2611 PHD students to work for one year, and that can go year after year :) If the overhead factor is 2.611 (I do not know how big it is, but it is less than 2.611, for sure), one can hire 1000 PHD students, at no additional cost :)
  • 5. 1. Over 95% of run time in loops 2. 3. 4. 5. 6. [loops to almost zero] Reusability of data (e.g., x+x2+x3+x4+…) [how close to zero?] BigData [prog: for data streaming, not for data control] Latency A new programming model WORM [prog.effort+comp.tim] Use a tractor, not a Ferrari, to drive over a plowed field 5/83
  • 6. Absolutely all results achieved in Europe: a) All hardware produced in Europe, specifically UK b) All software generated by programmers of EU and WB 6/83
  • 7. ControlFlow (MultiFlow and ManyFlow):  Top500 ranks using Linpack (Japanese K, IBM Sequoya, Cray Titan, …) DataFlow:  Coarse Grain (HEP) vs. Fine Grain (Maxeler) The history starts in 1960's! The enabler technology did not exist before the year 2000! 7/83
  • 8. Compiling below the machine code level brings speedups; also a smaller power, size, and cost. The price to pay: The machine is more difficult to program. Consequently: Ideal for WORM applications :) Examples using Maxeler: GeoPhysics (20-200), Banking (200-2000, with JP Morgan 20%), M&C (New York City), Datamining (Google), … 8/83
  • 10. 10/83
  • 11. Why Java? Minimal Kolmogorov Complexity, etc… 11/83
  • 12. 12
  • 13. 13
  • 14. tCPU = N * NOPS * CCPU*TclkCPU /NcoresCPU tGPU = N * NOPS * CGPU*TclkGPU / NcoresGPU tDF = NOPS * CDF * TclkDF + (N – 1) * TclkDF / NDF Assumptions: 1. Software includes enough parallelism to keep all cores busy 2. The only limiting factor is the number of cores. 14/83
  • 15. DualCore? Which way are the horses going? 15/83
  • 16. Is it possible to use 2000 chicken instead of two horses? ? == What is better, real and anecdotic? 16/83
  • 17. 2 x 1000 chickens (CUDA and rCUDA) 17/83
  • 18. at a D How about 2 000 000 ants? 18/83
  • 20. Factor: 20 to 200 MultiCore/ManyCore Dataflow Machine Level Code Gate Transfer Level 20/83
  • 22. Factor: 20 MultiCore/ManyCore DataFlow Data Processing Data Processing Process Control Process Control 22/83
  • 23.  MultiCore:  Explain what to do, to the driver  Caches, instruction buffers, and predictors needed  ManyCore:  Explain what to do, to many sub-drivers  Reduced caches and instruction buffers needed  DataFlow:  Make a field of processing gates: 1C+2nJava+3Java  No caches, etc. (300 students/year: BGD, BCN, LjU, ICL,…) 23/83
  • 24. MultiCore:  Business as usual ManyCore:  More difficult DataFlow:  Much more difficult  Debugging both, application and configuration code 24/83
  • 25.  MultiCore/ManyCore:  Several minutes  DataFlow:  Several hours for the real hardware  Fortunately, only several minutes for the simulator, several seconds for reload (90% due to DRAM inertia), and several milliseconds to restart  The simulator supports both the large JPMorgan machine as well as the smallest “University Support” machine  Good news:  Tabula@2GHz 25/83
  • 26. 26/83
  • 27. MultiCore:  Horse stable ManyCore:  Chicken house DataFlow:  Ant hole 27/83
  • 29. Small Data: Toy Benchmarks (e.g., Linpack) 29/83
  • 32. Maxeler Hardware CPUs plus DFEs Intel Xeon CPU cores and up to 4 DFEs with 192GB of RAM DFEs shared over Infiniband Up to 8 DFEs with 384GB of RAM and dynamic allocation of DFEs to CPU servers MaxWorkstation Desktop development system 32/83 Low latency connectivity Intel Xeon CPUs and 1-2 DFEs with up to six 10Gbit Ethernet connections MaxCloud On-demand scalable accelerated compute resource, hosted in London
  • 33. Major Classes of Algorithms, from the Computational Perspective 1. Coarse grained, stateful: Business – CPU requires DFE for minutes or hours – Interrupts 1. Fine grained, transactional with shared database: DM – CPU utilizes DFE for ms to s – Many short computations, accessing common database data 1. Fine grained, stateless transactional: Science (Phy, ...) – CPU requires DFE for ms to s – Many short computations 33/83
  • 34. Coarse Grained: Modeling 34/83 Timesteps (thousand) 70 60 Domain points (billion) 50 Total computed points (trillion) 40 30 20 10 0 0 10 20 30 40 50 Peak Frequency (Hz) 60 70 2,000 1,800 15Hz peak frequency 1,600 30Hz peak frequency 1,400 45Hz peak frequency 1,200 70Hz peak frequency 1,000 800 600 s r o c U P C t n e l a v i u q E • Long runtime, but: • Memory requirements change dramatically based on modelled frequency • Number of DFEs allocated to a CPU process can be easily varied to increase available memory • Streaming compression • Boundary data exchanged over chassis MaxRing 80 400 200 0 1 4 Number of MAX2 cards 8 80
  • 35. Fine Grained, Shared Data: Monitoring • DFE DRAM contains the database to be searched • CPUs issue transactions find(x, db) • Complex search function – Text search against documents – Shortest distance to coordinate (multi-dimensional) – Smith Waterman sequence alignment for genomes • Any CPU runs on any DFE that has been loaded with the database – MaxelerOS may add or remove DFEs from the processing group to balance system demands – New DFEs must be loaded with the search DB before use 35/83
  • 36. Fine Grained, Stateless: The BSOP Control • • • • Analyse > 1,000,000 scenarios Many CPU processes run on many DFEs ≈50x MPC-X vs. multi-core x86 node Each transaction executes on any DFE in the assigned group atomically CPU CPU CPU CPU CPU Market and instruments data Tail Tail Tail Tail Tail Tail Tail analysis Tail analysis Tail analysis Tail analysis analysis analysis analysis onCPU CPU analysis onCPU analysis CPU onCPU analysis onCPU onCPU on on CPU on on CPU on CPU Instrument values 36/83 DFE DFE DFE DFE DFE Loop over instruments Loop over instruments Loop over instruments Loop over instruments Loop over instruments Loop over instruments Loop over instruments Loop over instruments Loop over instruments Loop over instruments Random number Random number Random number Random number Random number Random number generator and generator Random numberand Random number generator and generator Random numberand Random number generator and generator and sampling of and sampling underliers generator and generator underliers sampling of of underliers sampling underliers generator and generator and sampling of of underliers sampling underliers sampling of of underliers sampling underliers sampling of of underliers sampling of underliers Price instruments Price instruments Price instruments Price instruments Price instruments Priceusing Black instruments using Black Price instruments Priceusing Black instruments using Black Price instruments Priceusing Scholes instruments Black using Black Scholes using Scholes Black using Black Scholes using Scholes Black using Black Scholes Scholes Scholes Scholes Scholes
  • 38. 38
  • 39. An MIS Example: Credit Derivatives
  • 41. 41
  • 42. Seismic Imaging • Running on MaxNode servers - 8 parallel compute pipelines per chip - 150MHz => low power consumption! - 30x faster than microprocessors An Implementation of the Acoustic Wave Equation on FPGAs T. Nemeth†, J. Stefani†, W. Liu†, R. Dimond‡, O. Pell‡, R.Ergas§ † Chevron, ‡Maxeler, §Formerly Chevron, SEG 2008 42/83
  • 43. The CRS Results  Performance of one MAX2 card vs. 1 CPU core  Land case (8 params), speedup of 230x  Marine case (6 params), speedup of 190x CPU Coherency 43/83 MAX2 Coherency
  • 44. 44
  • 45.
  • 47. P. Marchetti et al, 2010 Trace Stacking: Speed-up 217 • DM for Monitoring and Control in Seismic processing • Velocity independent / data driven method to obtain a stack of traces, based on 8 parameters • Search for every sample of each output trace 2 t 2 hyp  2 T  2t0 T =  t0 + w m  + m H zy K N H T m + h T H zy K NIP H T h zy zy   v0 v0   ( 2 parameters ( emergence angle & azimuth ) 3 Normal Wave front parameters ( KN,11; KN,12 ; KN22 ) 3 NIP Wave front parameters ( KNip,11; KNip,12 ; KNip22 ) 47/83 )
  • 48. Maxeler running Smith Waterman 48
  • 49.
  • 50. Molecular Correlates of Tumor Signatures from a Large Cohort From whole slide sections, of a cohort, to pathway analysis (Prof Bahram Parvin, Berkeley) High Content Analysis (HCA) on MPC-X
  • 51. 51
  • 52. Conclusion: Nota Bene This is about algorithmic changes, to maximize the algorithm to architecture match: algorithmic modifications, pipeline utilization, data choreography, and decision making precision. The winning paradigm of Big Data ExaScale? 52/83
  • 53. Algorithmic Changes: Data Dependencies PSI[0] … PSI[1] OP cbeta[0] OP cbeta[1] PSI[N-3] OP … … 0 OP’ OP’ … PSI[0] PSI[1] PSI[2] … PSI[N-2] PSI[N-1] OP cbeta[N-3] OP’ PSI[N-3] Example generated by Sasa Stojanovic (Gross-Pitaevskii) cbeta[N-2] OP’ 0 PSI[N-2] PSI[N-1] 53/83
  • 54. Pipeline Changes: Higher Efficiency 0 X[0,0] X[0,1] [0,0] 0 [0,1] [7,0] [7,0] [6,0] [6,0] [5,0] [5,0] [4,0] [4,0] [3,0] [3,0] [2,0] [2,0] [1,0] [1,0] [0,0] R[0,0] R[0,0] Example generated by Sasa Stojanovic (Gross-Pitaevskii) 54/83
  • 55. Data Recoreography: Pipeline Utilization Example generated by Sasa Stojanovic (Gross-Pitaevskii) Order of data accesses inside of a burst … … … 55/83
  • 56. Fixed Point: Savings Reinvestable • Consider fixed point compared to single precision floating point • If the range is tightly confined, one could use 24-bit fixed point • If data has a wider range, may need 32-bit fixed point hwFloat(8,24) hwFix(24,...) Add Multiply hwFix(32,...) 500 LUTs 24 LUTs 32 LUTs 2 DSPs 2 DSPs 4 DSPs • Arithmetic is not 100% of the chip. In practice, often ~5x performance boost from fixed point. 56
  • 57.  Revisiting the Top 500 SuperComputers benchmarks  Our paper in Communications of the ACM  Revisiting all major Big Data DM algorithms  Massive static parallelism at low clock frequencies  Concurrency and communication  Concurrency between millions of tiny cores difficult, “jitter” between cores will harm performance at synchronization points  Reliability and fault tolerance  10-100x fewer nodes, failures much less often  Memory bandwidth and FLOP/byte ratio  Optimize data choreography, data movement, and the algorithmic computation  New architecture of n-Programming paradigms 57/83
  • 58. FP7: RoMoL@BCN The SAB goal: Out of box thinking! 58/83
  • 59. FP7: BalCon@SRB The vision of Alkis Konstantellos The SAB goal: Seed for new proposals! 59/83
  • 61. DAFNE = South (MaxCode) + North (BigData) MISANU, IMP, KG, NS, UK BSC, UPV, Sweden U of Siena, U of Roma, Norway IJS, FRI, Denmark Germany IRB, France QPLAN, Bogazici, U of Istanbul, Austria U of Bucharest, U of Arad, Swiss U of Tuzla, Poland Technion, Maxeler Israel, IPSI Hungary 61/83 61/83
  • 63. The TriPeak @ DATAMAN Siena + BSC + Imperial College + Maxeler + Belgrade 63/83 46/83
  • 64. The TriPeak: Essence MontBlanc = A ManyCore (NVidia) + a MultiCore (ARM) Maxeler = A FineGrain DataFlow (FPGA) How about a happy marriage? MontBlanc (ompSS) and Maxeler (an accelerator) In each happy marriage, it is known who does what :) The Big Data DM algorithms: What part goes to MontBlanc and what to Maxeler? 64/83 64/83
  • 65. TriPeak: Core of the Symbiotic Success An intelligent DM algorithmic scheduler, partially implemented for compile time, and partially for run time. At compile time: Checking what part of code fits where (MontBlanc or Maxeler): LoC 1M vs 2K vs 20K At run time: Rechecking the compile time decision, based on the current data values. 65/83 65/83
  • 67. Maxeler: Research (Google: good method) Structure of a Typical Research Paper: Scenario #1 [Comparison of Platforms for One Algorithm] Curve A: MultiCore of approximately the same PurchasePrice Curve B: ManyCore of approximately the same PurchasePrice Curve C: Maxeler after a direct algorithm migration Curve D: Maxeler after algorithmic improvements Curve E: Maxeler after data choreography Curve F: Maxeler after precision modifications Structure of a Typical Research Paper: Scenario #2 [Ranking of Algorithms for One Application] CurveSet A: Comparison of Algorithms on a MultiCore CurveSet B: Comparison of Algorithms on a ManyCore CurveSet C: Comparison on Maxeler, after a direct algorithm migration CurveSet D: Comparison on Maxeler, after algorithmic improvements CurveSet E: Comparison on Maxeler, after data choreography CurveSet F: Comparison on Maxeler, after precision modifications 67/83 67/83
  • 68. Maxeler Research in Serbia: Special Issue of IPSI Transactions Journal KG: Blood Flow, Tijana Djukic and Prof. Filipovic NS: Combinatorial Math, Prof. Senk and Ivan Stanojevic MISANU: The SAT Math, Zivojin Sustran and Prof. Ognjanovic ETF: Meteorology, Radomir Radojicic and Marko Stankovic ETF: Physics (Gross Pitaevskii 3D real), Sasa Stojanovic ETF: Physics (Gross Pitaevskii 3D imaginary), Lena Parezanovic 68/83 68/83
  • 69. Maxeler Research WorldWide: Special Issue of Advances in Computers @ SCI Stanford, Texas, Imperial, Maxeler, ETF, MF, MISANU, IMP, KG, NS, BSC, UPV, U of Siena, U of Roma, IJS, FRI, … 69/83 69/83
  • 71. Maxeler: Teaching (Google: prof vm) VLSI, PowerPoints, Maxeler: TEACHING, Maxeler Veljko Explanations, August 2012 Maxeler Veljko Anegdotic, Maxeler Oskar Talk, August 2012 Maxeler Forbes Article Flyer by JP Morgan Flyer by Maxeler HPC Tutorial Slides by Sasha and Veljko: Practice (Current Update) Paper, unconditionally accepted for Advances in Computers by Elsevier Paper, unconditionally accepted for Communications of the ACM Tutorial Slides by Oskar: Theory (7 parts) Slides by Jacob, New York Slides by Jacob, Alabama Slides by Sasha: Practice (Current Update) Maxeler in Meteorology Maxeler in Mathematics Examples generated in Belgrade and Worldwide THE COURSE ALSO INCLUDES DARPA METHODOLOGY FOR MICROPROCESSOR DESIGN, with an example 71/83 71/83
  • 72. Maxeler PreConference Tutorials (2013) Google: IEEE HiPeak, Berlin, Germany, January 2013 ACM iSAC, Coimbra, Portugal, March 2013 IEEE MECO, Budva, Montenegro, June 2013 ACM ISCA, Tel Aviv, Israel, June 2013 72/83 72/83
  • 73. Maxeler InHouse Tutorials (2013) 73/83 73/83
  • 75. Maxeler University Program Members 75/83
  • 76. How to Become a Family Member? Options to consider: a. MAX-UP free of charge b. Purchasing a university-level machine (min about $10K) c. Purchasing a JPM-level machine (slowly approaching $100M), or at least a Schlumberger-level machine (slowly moving above $10M) 76/83 76/83
  • 77. Good to Know! Maxeler employs close to 100 people, GBR and USA: a. Maxeler cash burn per year = about $10M b. If a university-level machine is sold at the 100% profit margin, the company life of Maxeler is extended for about 2 hours. c. If a university-level machine is sold at the 1% profit margin, the company life of Maxeler is extended for 1 minute. Our past or ongoing FP7 projects requiring Maxeler speeds: a. ProSense b. ARTreat c. HiPEAC 77/83 77/83
  • 78. The Educational Mission Important note: a. Total number of accredited universities in the whole world? b. As per WeboMetrics, about 20000. c. Consequently, all universities of the world together bring only: 20000 minutes of extra life, or about two weeks of extra life. The reality: a. University-level machines are sold at the ZERO profit margin! b. Only the Xilinx costs, handling, and shipping. c. Email support for student doing thesis is practically unlimited! Conclusion: This is a chance for those who jump in first :) 78/83 78/83
  • 79. Our Work Impacting Maxeler Milutinovic, V., Knezevic, P., Radunovic, B., Casselman, S., Schewel, J., Obelix Searches Internet Using Customer Data, IEEE COMPUTER, July 2000 (impact factor 2.205/2010). Milutinovic, V., Cvetkovic, D., Mirkovic, J., Genetic Search Based on Multiple Mutation Approaches, IEEE COMPUTER, November 2000 (impact factor 2.205/2010). Milutinovic, V., Ngom, A., Stojmenovic, I., STRIP --- A Strip Based Neural Network Growth Algorithm for Learning Multiple-Valued Functions, IEEE TRANSACTIONS ON NEURAL NETWORKS, March 2001, Vol.12, No.2, pp. 212-227. Jovanov, E., Milutinovic, V., Hurson, A., Acceleration of Nonnumeric Operations Using Hardware Support for the Ordered Table Hashing Algorithms, IEEE TRANSACTIONS ON COMPUTERS, September 2002, Vol.51, No.9, pp. 1026-1040 (impact factor 1.822/2010). 79/83 79/83
  • 80. Maxeler Impacting Our Work Tafa, Z., Rakocevic, G., Mihailovic, Dj., Milutinovic, V., Effects of Interdisciplinary Education On Technology-driven Application Design IEEE Transactions on Education, August 2011, pp.462-470. (impact factor 1.328/2010). Tomazic, S., Pavlovic, V., Milovanovic, J., Sodnik, J., Kos, A., Stancin, S., Milutinovic, V., Fast File Existence Checking in Archiving Systems ACM Transactions on Storage (TOS) TOS Homepage archive, Volume 7 Issue 1, June 2011, ACM New York, NY, USA. Jovanovic, Z., Milutinovic, V., FPGA Accelerator for Floating-Point Matrix Multiplication, IEE Computers & Digital Techniques, 2012, 6, (4), pp. 249256. Flynn, M., Mencer, O., Milutinovic, V., Rakocevic, G., Stenstrom, P., Trobec, R., and Valero, M., Moving from Petaflops (on Simple Benchmarks) to Petadata per Unit of Time and Power (On Sophisticated Benchmarks) Communications of the ACM, May 2013 (impact factor 1.919/2010). 80/83 80/83
  • 81. Current Main Efforts of Maxeler 1. To encourage a lot of software to be written/ported. This is a key business opportunity that needs to be developed. 2. Maxeler is building up a website and a community to share software for DFEs. This would allow the software to also be sold directly from the Maxeler website. 3. If a PhD student ports an important software to a Maxeler machine, she/he could become the first software vendor in the world for dataflow computers, and Maxeler would be happy to help sell licenses. 81/83
  • 82. Current Side Efforts of Maxeler 1. Developing new tools for easier making of kernels. 2. Bringing new languages to Maxeler: C, C++, MathLab, Matematika 3. Porting popular application packages to Maxeler: OpenSees, etc... 4. Trying the Tabula FPGA! 5. Getting more than 1TeraByte/sec thru I/O 6. Minimizing the hardware, so it can go into Galaxy 5,6… 82/83
  • 83. NewTools: MaxSkins Custom Engine Interfaces (.c) MaxCompiler .max file Testing / Application integration Dataflow Design (.maxj) MaxCompiler App Packager .max file developer .max file user App Installer SLiC level programming MATLAB .mex .m C/C++ R Excel 83 Python 83/83
  • 84. Getting Started a Practical Work from the Linux Shell 1. Open a shell terminal (e.g., $ /usr/bin/xfce4-terminal). 2. Connect to the Maxeler machine (e.g., $ ssh root@147.91.12.216). 3. If more shell screens needed, start screen (e.g., $ screen). 4. Switch to the directory that contains the 2n+3 programs you wrote (e.g., $ cd Desktop/workspace/src/ind/z88/). 5. Prepare your C code for measuring the execution time (e.g., clock_gettime(CLOCK_REALTIME, &t2);). 6. See what you can do (e.g., $ make). 7. Select one of those that you can do (e.g., $ make build-sim, $ make run-sim, $ make build-hw, $ make run-hw). 8. Measure the power consumption at the wall plug. 84/83

Notas del editor

  1. Elastic makes things worse