Anegdotic Maxeler (Romania)

V. Milutinovic, G. Rakocevic, S. Stojanovic, and Z. Sustran
University of Belgrade

Oskar Mencer
Imperial College, London

Oliver Pell

Maxeler Technologies, London and Palo Alto

Michael Flynn
Stanford University, Palo Alto, USA

Valentina Balas
Aurel Vlaicu University of Arad, Romania, Maxeler Ambassador

1/83

An Alternative Title:
How to hire
more than 1000 PhD students
at no additional cost
to tax payers?

For Big Data algorithms
and for the same hardware price as before,
achieving:
a) speed-up, 20-200
b) monthly electricity bills, reduced 20 times
c) size, 20 times smaller
The major issues of engineering are: design cost and design complexity.
Remember, economy has its own rules: production count and market demand!
3/83

Elaboration :)
If a computer center spends E50M/year on electricity bills,
and moves most of its time-consuming algorithms to Maxeler,
which uses 20 times less power,
the yearly spending drops down to E2.5M,
and E47.5M is saved to tax payers :)
If the average net salary of a PHD student in Germany is E1500,
and if the overhead factor is 1.00,
it is easy to calculate
that E47.5M can pay 2611 PHD students to work for one year,
and that can go year after year :)
If the overhead factor is 2.611
(I do not know how big it is, but it is less than 2.611, for sure),
one can hire 1000 PHD students, at no additional cost :)

1. Over 95% of run time in loops
2.
3.

4.
5.
6.

[loops to almost zero]
Reusability of data (e.g., x+x2+x3+x4+…)
[how close to zero?]
BigData
[prog: for data streaming, not for data
control]
Latency
A new programming model
WORM [prog.effort+comp.tim]
Use a tractor, not a Ferrari, to drive over a plowed field

5/83

Absolutely all results achieved in Europe:
a) All hardware produced in Europe,
specifically UK
b) All software generated by programmers
of EU and WB

6/83

ControlFlow (MultiFlow and ManyFlow):
 Top500 ranks using Linpack

(Japanese K, IBM Sequoya, Cray Titan, …)

DataFlow:
 Coarse Grain (HEP) vs. Fine Grain (Maxeler)
The history starts in 1960's!
The enabler technology did not exist before the year 2000!

7/83

Compiling below the machine code level brings speedups;
also a smaller power, size, and cost.
The price to pay:
The machine is more difficult to program.
Consequently:
Ideal for WORM applications :)
Examples using Maxeler:
GeoPhysics (20-200), Banking (200-2000, with JP Morgan 20%),
M&C (New York City), Datamining (Google), …
8/83

Simulator builder
Hardware builder
9
2n+3

Why Java? Minimal Kolmogorov Complexity, etc…

11/83

tCPU =
N * NOPS * CCPU*TclkCPU
/NcoresCPU

tGPU =
N * NOPS * CGPU*TclkGPU /
NcoresGPU

tDF = NOPS * CDF * TclkDF +
(N – 1) * TclkDF / NDF

Assumptions:
1. Software includes enough parallelism to keep all cores busy
2. The only limiting factor is the number of cores.

14/83

DualCore?

Which way are the horses
going?

15/83

Is it possible

to use 2000 chicken instead of two horses?

?
==

What is better, real and anecdotic?
16/83

2 x 1000 chickens (CUDA and rCUDA)
17/83

at a
D

How about 2 000 000
ants?
18/83

Big Data Input

Results

Marmalade

19/83

Factor: 20 to 200
MultiCore/ManyCore

Dataflow

Machine Level Code

Gate Transfer Level
20/83

Factor: 20
MultiCore/ManyCore

Dataflow

21/83

Factor: 20
MultiCore/ManyCore

DataFlow

Data Processing

Data Processing
Process Control

Process Control

22/83

 MultiCore:
 Explain what to do, to the driver
 Caches, instruction buffers, and predictors needed

 ManyCore:
 Explain what to do, to many sub-drivers
 Reduced caches and instruction buffers needed

 DataFlow:
 Make a field of processing gates: 1C+2nJava+3Java
 No caches, etc. (300 students/year: BGD, BCN, LjU, ICL,…)
23/83

MultiCore:
 Business as usual

ManyCore:
 More difficult

DataFlow:
 Much more difficult
 Debugging both, application and configuration code

24/83

 MultiCore/ManyCore:
 Several minutes

 DataFlow:

 Several hours for the real hardware
 Fortunately, only several minutes for the simulator,

several seconds for reload (90% due to DRAM inertia),
and several milliseconds to restart
 The simulator supports
both the large JPMorgan machine
as well as the smallest “University Support” machine
 Good news:

 Tabula@2GHz
25/83

MultiCore:
 Horse stable

ManyCore:
 Chicken house

DataFlow:
 Ant hole

27/83

MultiCore:
 Haystack

ManyCore:
 Cornbits

DataFlow:
 Crumbs

28/83

Small Data: Toy Benchmarks (e.g., Linpack)
29/83

Medium Data
(benchmarks
favorising NVidia,
compared to Intel,…)

30/83

Maxeler Hardware

CPUs plus DFEs
Intel Xeon CPU cores and up to
4 DFEs with 192GB of RAM

DFEs shared over Infiniband
Up to 8 DFEs with 384GB of
RAM and dynamic allocation
of DFEs to CPU servers

MaxWorkstation
Desktop development system

32/83

Low latency connectivity
Intel Xeon CPUs and 1-2 DFEs
with up to six 10Gbit Ethernet
connections

MaxCloud
On-demand scalable accelerated
compute resource, hosted in London

Major Classes of Algorithms,
from the Computational Perspective
1. Coarse grained, stateful: Business
– CPU requires DFE for minutes or hours
– Interrupts

1. Fine grained, transactional with shared database: DM
– CPU utilizes DFE for ms to s
– Many short computations, accessing common database data

1. Fine grained, stateless transactional: Science (Phy, ...)
– CPU requires DFE for ms to s
– Many short computations

33/83

Coarse Grained: Modeling

34/83

Timesteps (thousand)

70
60

Domain points (billion)

50

Total computed points (trillion)

40
30
20
10
0
0

10

20

30
40
50
Peak Frequency (Hz)

60

70

2,000
1,800

15Hz peak frequency

1,600

30Hz peak frequency

1,400

45Hz peak frequency

1,200

70Hz peak frequency

1,000
800
600

s
r
o
c
U
P
C
t
n
e
l
a
v
i
u
q
E

• Long runtime, but:
• Memory requirements
change dramatically based
on modelled frequency
• Number of DFEs allocated
to a CPU process can be
easily varied to increase
available memory
• Streaming compression
• Boundary data exchanged
over chassis MaxRing

80

400
200
0
1

4
Number of MAX2 cards

8

80

Fine Grained, Shared Data: Monitoring
• DFE DRAM contains the database to be searched
• CPUs issue transactions find(x, db)
• Complex search function
– Text search against documents
– Shortest distance to coordinate (multi-dimensional)
– Smith Waterman sequence alignment for genomes

• Any CPU runs on any DFE
that has been loaded with the database
– MaxelerOS may add or remove DFEs
from the processing group to balance system demands
– New DFEs must be loaded with the search DB before use
35/83

Fine Grained, Stateless: The BSOP Control
•
•
•
•

Analyse > 1,000,000 scenarios
Many CPU processes run on many DFEs
≈50x MPC-X vs. multi-core x86 node
Each transaction executes on any DFE
in the assigned group atomically
CPU
CPU
CPU
CPU
CPU

Market and
instruments
data

Tail
Tail
Tail
Tail
Tail
Tail
Tail
analysis
Tail
analysis
Tail
analysis
Tail
analysis
analysis
analysis
analysis
onCPU
CPU
analysis
onCPU
analysis CPU
onCPU
analysis
onCPU
onCPU
on
on CPU
on
on CPU
on CPU
Instrument
values

36/83

DFE
DFE
DFE
DFE
DFE

Loop over instruments
Random number
Random number
Random number
Random number
Random number
Random number
generator and
generator
Random numberand
Random number
generator and
generator
Random numberand
Random number
generator and
generator and
sampling of and
sampling underliers
generator and
generator underliers
sampling of of underliers
sampling underliers
generator and
generator and
sampling underliers
sampling underliers
sampling of underliers
Price instruments
Price instruments
Price instruments
Price instruments
Price instruments
Priceusing Black
instruments
using Black
Price instruments
Priceusing Black
instruments
using Black
Price instruments
Priceusing Scholes
instruments
Black
using Black
Scholes
using Scholes
Black
using Black
Scholes
using Scholes
Black
using Black
Scholes
Scholes
Scholes
Scholes
Scholes

Selected Examples:
Business,
Mathematics,
GeoPhysics, etc.
37/83

An MIS Example: Credit
Derivatives

Orbital station

Climber

Tether

HW

Seismic Imaging

• Running on MaxNode servers
- 8 parallel compute pipelines per chip
- 150MHz => low power consumption!
- 30x faster than microprocessors

An Implementation of the Acoustic Wave Equation on FPGAs
T. Nemeth†, J. Stefani†, W. Liu†, R. Dimond‡, O. Pell‡, R.Ergas§
†
Chevron, ‡Maxeler, §Formerly Chevron, SEG 2008
42/83

The CRS Results


Performance of one MAX2 card vs. 1 CPU core


Land case (8 params), speedup of 230x



Marine case (6 params), speedup of 190x
CPU Coherency

43/83

MAX2 Coherency

P. Marchetti et al, 2010

Trace Stacking: Speed-up 217
• DM for Monitoring and Control in Seismic processing
• Velocity independent / data driven method
to obtain a stack of traces, based on 8 parameters
• Search for every sample of each output trace
2

t

2
hyp


2 T  2t0 T
=  t0 + w m  +
m H zy K N H T m + h T H zy K NIP H T h
zy
zy


v0
v0



(

2 parameters ( emergence angle & azimuth )
3 Normal Wave front parameters ( KN,11; KN,12 ; KN22 )
3 NIP Wave front parameters ( KNip,11; KNip,12 ; KNip22 )
47/83

)

Maxeler running Smith Waterman

48

Molecular Correlates of Tumor Signatures
from a Large Cohort
From whole slide sections, of a cohort,
to pathway analysis (Prof Bahram Parvin,
Berkeley)

High Content Analysis (HCA) on MPC-X

Conclusion: Nota Bene
This is about algorithmic changes,
to maximize
the algorithm to architecture match:
algorithmic modifications,
pipeline utilization,
data choreography,
and
decision making precision.
The winning paradigm of Big Data ExaScale?

52/83

Algorithmic Changes: Data Dependencies
PSI[0]

…

PSI[1]

OP

cbeta[0]

OP

cbeta[1]

PSI[N-3]

OP

…

…

0

OP’

OP’

…

PSI[0]

PSI[1]

PSI[2]

…

PSI[N-2]

PSI[N-1]

OP

cbeta[N-3]

OP’

PSI[N-3]

Example generated by Sasa Stojanovic (Gross-Pitaevskii)

cbeta[N-2]

OP’

0

PSI[N-2]

PSI[N-1]

53/83

Pipeline Changes: Higher Efficiency
0
X[0,0]
X[0,1]
[0,0]

0

[0,1]
[7,0]
[7,0]
[6,0]
[6,0]
[5,0]
[5,0]
[4,0]
[4,0]
[3,0]
[3,0]
[2,0]
[2,0]
[1,0]
[1,0]
[0,0]

R[0,0]

R[0,0]

54/83

Data Recoreography: Pipeline Utilization

Order of data accesses
inside of a burst

…

…

…

55/83

Fixed Point: Savings Reinvestable
• Consider fixed point
compared to single precision floating point
• If the range is tightly confined,
one could use 24-bit fixed point
• If data has a wider range, may need 32-bit fixed point
hwFloat(8,24) hwFix(24,...)
Add
Multiply

hwFix(32,...)

500 LUTs

24 LUTs

32 LUTs

2 DSPs

2 DSPs

4 DSPs

• Arithmetic is not 100% of the chip.
In practice, often ~5x performance boost from fixed point.
56

 Revisiting the Top 500 SuperComputers benchmarks
 Our paper in Communications of the ACM

 Revisiting all major Big Data DM algorithms

 Massive static parallelism at low clock frequencies

 Concurrency and communication

 Concurrency between millions of tiny cores difficult,

“jitter” between cores will harm performance
at synchronization points

 Reliability and fault tolerance

 10-100x fewer nodes, failures much less often

 Memory bandwidth and FLOP/byte ratio

 Optimize data choreography, data movement,

and the algorithmic computation

 New architecture of n-Programming paradigms
57/83

FP7: RoMoL@BCN

The SAB goal: Out of box thinking!
58/83

FP7: BalCon@SRB

The vision of Alkis Konstantellos

The SAB goal: Seed for new proposals!
59/83

DAFNE = South (MaxCode) + North
(BigData)
MISANU, IMP, KG, NS,
UK
BSC, UPV,
Sweden
U of Siena, U of Roma,
Norway
IJS, FRI,
Denmark
Germany
IRB,
France
QPLAN,
Bogazici, U of Istanbul,
Austria
U of Bucharest, U of Arad,
Swiss
U of Tuzla,
Poland
Technion, Maxeler Israel, IPSI
Hungary
61/83
61/83

The TriPeak @
DATAMAN

Siena
+ BSC
+ Imperial College
+ Maxeler
+ Belgrade

63/83
46/83

The TriPeak: Essence
MontBlanc = A ManyCore (NVidia) + a MultiCore (ARM)
Maxeler = A FineGrain DataFlow (FPGA)
How about a happy marriage?
MontBlanc (ompSS) and Maxeler (an accelerator)
In each happy marriage,
it is known who does what :)
The Big Data DM algorithms:
What part goes to MontBlanc and what to Maxeler?
64/83
64/83

TriPeak: Core of the Symbiotic
Success
An intelligent DM algorithmic scheduler,
partially implemented for compile time,
and partially for run time.
At compile time:
Checking what part of code fits where
(MontBlanc or Maxeler): LoC 1M vs 2K vs 20K
At run time:
Rechecking the compile time decision,
based on the current data values.
65/83
65/83

Maxeler: Research (Google: good
method)

Structure of a Typical Research Paper: Scenario #1
[Comparison of Platforms for One Algorithm]
Curve A: MultiCore of approximately the same PurchasePrice
Curve B: ManyCore of approximately the same PurchasePrice
Curve C: Maxeler after a direct algorithm migration
Curve D: Maxeler after algorithmic improvements
Curve E: Maxeler after data choreography
Curve F: Maxeler after precision modifications

Structure of a Typical Research Paper: Scenario #2
[Ranking of Algorithms for One Application]
CurveSet A: Comparison of Algorithms on a MultiCore
CurveSet B: Comparison of Algorithms on a ManyCore
CurveSet C: Comparison on Maxeler, after a direct algorithm migration
CurveSet D: Comparison on Maxeler, after algorithmic improvements
CurveSet E: Comparison on Maxeler, after data choreography
CurveSet F: Comparison on Maxeler, after precision modifications

67/83
67/83

Maxeler Research in Serbia:
Special Issue of IPSI Transactions
Journal
KG: Blood Flow, Tijana Djukic and Prof. Filipovic

NS: Combinatorial Math, Prof. Senk and Ivan Stanojevic
MISANU: The SAT Math, Zivojin Sustran and Prof. Ognjanovic
ETF: Meteorology, Radomir Radojicic and Marko Stankovic
ETF: Physics (Gross Pitaevskii 3D real), Sasa Stojanovic
ETF: Physics (Gross Pitaevskii 3D imaginary), Lena Parezanovic
68/83
68/83

Maxeler Research WorldWide:
Special Issue of Advances in Computers @ SCI

Stanford, Texas,
Imperial, Maxeler,
ETF, MF, MISANU, IMP, KG, NS,
BSC, UPV,
U of Siena, U of Roma,
IJS, FRI, …

69/83
69/83

Maxeler: Teaching (Google: prof
vm) VLSI, PowerPoints, Maxeler:
TEACHING,
Maxeler Veljko Explanations, August 2012
Maxeler Veljko Anegdotic,
Maxeler Oskar Talk, August 2012
Maxeler Forbes Article
Flyer by JP Morgan
Flyer by Maxeler HPC
Tutorial Slides by Sasha and Veljko: Practice (Current Update)
Paper, unconditionally accepted for Advances in Computers by Elsevier
Paper, unconditionally accepted for Communications of the ACM
Tutorial Slides by Oskar: Theory (7 parts)
Slides by Jacob, New York
Slides by Jacob, Alabama
Slides by Sasha: Practice (Current Update)
Maxeler in Meteorology
Maxeler in Mathematics
Examples generated in Belgrade and Worldwide
THE COURSE ALSO INCLUDES DARPA METHODOLOGY FOR MICROPROCESSOR DESIGN,
with an example
71/83
71/83

Maxeler PreConference Tutorials (2013)
Google:
IEEE HiPeak, Berlin, Germany, January 2013
ACM iSAC, Coimbra, Portugal, March 2013
IEEE MECO, Budva, Montenegro, June 2013
ACM ISCA, Tel Aviv, Israel, June 2013

72/83
72/83

Maxeler InHouse Tutorials (2013)

73/83
73/83

Maxeler University Program Members

75/83

How to Become a Family Member?
Options to consider:
a. MAX-UP free of charge
b. Purchasing a university-level machine
(min about $10K)
c. Purchasing a JPM-level machine
(slowly approaching $100M),
or at least a Schlumberger-level machine
(slowly moving above $10M)
76/83
76/83

Good to Know!

Maxeler employs close to 100 people, GBR and USA:
a. Maxeler cash burn per year = about $10M
b. If a university-level machine is sold at the 100% profit margin,
the company life of Maxeler is extended for about 2 hours.
c. If a university-level machine is sold at the 1% profit margin,
the company life of Maxeler is extended for 1 minute.
Our past or ongoing FP7 projects requiring Maxeler speeds:
a. ProSense
b. ARTreat
c. HiPEAC

77/83
77/83

The Educational Mission
Important note:
a. Total number of accredited universities in the whole world?
b. As per WeboMetrics, about 20000.
c. Consequently, all universities of the world together bring only:
20000 minutes of extra life, or about two weeks of extra life.
The reality:
a. University-level machines are sold at the ZERO profit margin!
b. Only the Xilinx costs, handling, and shipping.
c. Email support for student doing thesis is practically unlimited!
Conclusion: This is a chance for those who jump in first :)
78/83
78/83

Our Work Impacting Maxeler
Milutinovic, V., Knezevic, P., Radunovic, B., Casselman, S., Schewel, J., Obelix
Searches Internet Using Customer Data, IEEE COMPUTER, July 2000 (impact
factor 2.205/2010).
Milutinovic, V., Cvetkovic, D., Mirkovic, J., Genetic Search Based on Multiple
Mutation Approaches, IEEE COMPUTER, November 2000 (impact factor
2.205/2010).
Milutinovic, V., Ngom, A., Stojmenovic, I., STRIP --- A Strip Based Neural Network
Growth Algorithm for Learning Multiple-Valued Functions, IEEE TRANSACTIONS
ON NEURAL NETWORKS, March 2001, Vol.12, No.2, pp. 212-227.
Jovanov, E., Milutinovic, V., Hurson, A., Acceleration of Nonnumeric Operations
Using Hardware Support for the Ordered Table Hashing Algorithms, IEEE
TRANSACTIONS ON COMPUTERS, September 2002, Vol.51, No.9, pp. 1026-1040
(impact factor 1.822/2010).
79/83
79/83

Maxeler Impacting Our Work
Tafa, Z., Rakocevic, G., Mihailovic, Dj., Milutinovic, V., Effects of
Interdisciplinary Education On Technology-driven Application Design IEEE
Transactions on Education, August 2011, pp.462-470. (impact factor
1.328/2010).
Tomazic, S., Pavlovic, V., Milovanovic, J., Sodnik, J., Kos, A., Stancin, S.,
Milutinovic, V., Fast File Existence Checking in Archiving Systems ACM
Transactions on Storage (TOS) TOS Homepage archive, Volume 7 Issue 1,
June 2011, ACM New York, NY, USA.
Jovanovic, Z., Milutinovic, V., FPGA Accelerator for Floating-Point Matrix
Multiplication, IEE Computers & Digital Techniques, 2012, 6, (4), pp. 249256.
Flynn, M., Mencer, O., Milutinovic, V., Rakocevic, G., Stenstrom, P., Trobec,
R., and Valero, M., Moving from Petaflops (on Simple Benchmarks) to
Petadata per Unit of Time and Power (On Sophisticated Benchmarks)
Communications of the ACM, May 2013 (impact factor 1.919/2010).
80/83
80/83

Current Main Efforts of Maxeler
1. To encourage a lot of software to be written/ported.
This is a key business opportunity that needs to be developed.
2. Maxeler is building up a website and a community
to share software for DFEs.
This would allow the software to also be sold
directly from the Maxeler website.
3. If a PhD student ports an important software
to a Maxeler machine,
she/he could become the first software vendor in the world
for dataflow computers,
and Maxeler would be happy to help sell licenses.
81/83

Current Side Efforts of Maxeler
1. Developing new tools for easier making of kernels.
2. Bringing new languages to Maxeler:
C, C++, MathLab, Matematika
3. Porting popular application packages to Maxeler:
OpenSees, etc...
4. Trying the Tabula FPGA!
5. Getting more than 1TeraByte/sec thru I/O
6. Minimizing the hardware, so it can go into Galaxy 5,6…
82/83

NewTools: MaxSkins
Custom Engine
Interfaces
(.c)

MaxCompiler

.max file

Testing /
Application integration

Dataflow Design
(.maxj)

MaxCompiler
App Packager

.max file developer
.max file user

App
Installer

SLiC level programming MATLAB

.mex

.m

C/C++

R

Excel

83

Python

83/83

Getting Started a Practical Work
from the Linux Shell
1. Open a shell terminal (e.g., $ /usr/bin/xfce4-terminal).
2. Connect to the Maxeler machine
(e.g., $ ssh root@147.91.12.216).
3. If more shell screens needed, start screen (e.g., $ screen).
4. Switch to the directory that contains
the 2n+3 programs you wrote
(e.g., $ cd Desktop/workspace/src/ind/z88/).
5. Prepare your C code for measuring the execution time
(e.g., clock_gettime(CLOCK_REALTIME, &t2);).
6. See what you can do (e.g., $ make).
7. Select one of those that you can do
(e.g., $ make build-sim, $ make run-sim,
$ make build-hw, $ make run-hw).
8. Measure the power consumption at the wall plug.
84/83

Anegdotic Maxeler (Romania)

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Anegdotic Maxeler (Romania)

Similar a Anegdotic Maxeler (Romania) (20)

Último

Último (20)

Anegdotic Maxeler (Romania)

Notas del editor