Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Anegdotic Maxeler (Romania)
1. V. Milutinovic, G. Rakocevic, S. Stojanovic, and Z. Sustran
University of Belgrade
Oskar Mencer
Imperial College, London
Oliver Pell
Maxeler Technologies, London and Palo Alto
Michael Flynn
Stanford University, Palo Alto, USA
Valentina Balas
Aurel Vlaicu University of Arad, Romania, Maxeler Ambassador
1/83
3. For Big Data algorithms
and for the same hardware price as before,
achieving:
a) speed-up, 20-200
b) monthly electricity bills, reduced 20 times
c) size, 20 times smaller
The major issues of engineering are: design cost and design complexity.
Remember, economy has its own rules: production count and market demand!
3/83
4. Elaboration :)
If a computer center spends E50M/year on electricity bills,
and moves most of its time-consuming algorithms to Maxeler,
which uses 20 times less power,
the yearly spending drops down to E2.5M,
and E47.5M is saved to tax payers :)
If the average net salary of a PHD student in Germany is E1500,
and if the overhead factor is 1.00,
it is easy to calculate
that E47.5M can pay 2611 PHD students to work for one year,
and that can go year after year :)
If the overhead factor is 2.611
(I do not know how big it is, but it is less than 2.611, for sure),
one can hire 1000 PHD students, at no additional cost :)
5. 1. Over 95% of run time in loops
2.
3.
4.
5.
6.
[loops to almost zero]
Reusability of data (e.g., x+x2+x3+x4+…)
[how close to zero?]
BigData
[prog: for data streaming, not for data
control]
Latency
A new programming model
WORM [prog.effort+comp.tim]
Use a tractor, not a Ferrari, to drive over a plowed field
5/83
6. Absolutely all results achieved in Europe:
a) All hardware produced in Europe,
specifically UK
b) All software generated by programmers
of EU and WB
6/83
7. ControlFlow (MultiFlow and ManyFlow):
Top500 ranks using Linpack
(Japanese K, IBM Sequoya, Cray Titan, …)
DataFlow:
Coarse Grain (HEP) vs. Fine Grain (Maxeler)
The history starts in 1960's!
The enabler technology did not exist before the year 2000!
7/83
8. Compiling below the machine code level brings speedups;
also a smaller power, size, and cost.
The price to pay:
The machine is more difficult to program.
Consequently:
Ideal for WORM applications :)
Examples using Maxeler:
GeoPhysics (20-200), Banking (200-2000, with JP Morgan 20%),
M&C (New York City), Datamining (Google), …
8/83
14. tCPU =
N * NOPS * CCPU*TclkCPU
/NcoresCPU
tGPU =
N * NOPS * CGPU*TclkGPU /
NcoresGPU
tDF = NOPS * CDF * TclkDF +
(N – 1) * TclkDF / NDF
Assumptions:
1. Software includes enough parallelism to keep all cores busy
2. The only limiting factor is the number of cores.
14/83
23. MultiCore:
Explain what to do, to the driver
Caches, instruction buffers, and predictors needed
ManyCore:
Explain what to do, to many sub-drivers
Reduced caches and instruction buffers needed
DataFlow:
Make a field of processing gates: 1C+2nJava+3Java
No caches, etc. (300 students/year: BGD, BCN, LjU, ICL,…)
23/83
24. MultiCore:
Business as usual
ManyCore:
More difficult
DataFlow:
Much more difficult
Debugging both, application and configuration code
24/83
25. MultiCore/ManyCore:
Several minutes
DataFlow:
Several hours for the real hardware
Fortunately, only several minutes for the simulator,
several seconds for reload (90% due to DRAM inertia),
and several milliseconds to restart
The simulator supports
both the large JPMorgan machine
as well as the smallest “University Support” machine
Good news:
Tabula@2GHz
25/83
32. Maxeler Hardware
CPUs plus DFEs
Intel Xeon CPU cores and up to
4 DFEs with 192GB of RAM
DFEs shared over Infiniband
Up to 8 DFEs with 384GB of
RAM and dynamic allocation
of DFEs to CPU servers
MaxWorkstation
Desktop development system
32/83
Low latency connectivity
Intel Xeon CPUs and 1-2 DFEs
with up to six 10Gbit Ethernet
connections
MaxCloud
On-demand scalable accelerated
compute resource, hosted in London
33. Major Classes of Algorithms,
from the Computational Perspective
1. Coarse grained, stateful: Business
– CPU requires DFE for minutes or hours
– Interrupts
1. Fine grained, transactional with shared database: DM
– CPU utilizes DFE for ms to s
– Many short computations, accessing common database data
1. Fine grained, stateless transactional: Science (Phy, ...)
– CPU requires DFE for ms to s
– Many short computations
33/83
34. Coarse Grained: Modeling
34/83
Timesteps (thousand)
70
60
Domain points (billion)
50
Total computed points (trillion)
40
30
20
10
0
0
10
20
30
40
50
Peak Frequency (Hz)
60
70
2,000
1,800
15Hz peak frequency
1,600
30Hz peak frequency
1,400
45Hz peak frequency
1,200
70Hz peak frequency
1,000
800
600
s
r
o
c
U
P
C
t
n
e
l
a
v
i
u
q
E
• Long runtime, but:
• Memory requirements
change dramatically based
on modelled frequency
• Number of DFEs allocated
to a CPU process can be
easily varied to increase
available memory
• Streaming compression
• Boundary data exchanged
over chassis MaxRing
80
400
200
0
1
4
Number of MAX2 cards
8
80
35. Fine Grained, Shared Data: Monitoring
• DFE DRAM contains the database to be searched
• CPUs issue transactions find(x, db)
• Complex search function
– Text search against documents
– Shortest distance to coordinate (multi-dimensional)
– Smith Waterman sequence alignment for genomes
• Any CPU runs on any DFE
that has been loaded with the database
– MaxelerOS may add or remove DFEs
from the processing group to balance system demands
– New DFEs must be loaded with the search DB before use
35/83
36. Fine Grained, Stateless: The BSOP Control
•
•
•
•
Analyse > 1,000,000 scenarios
Many CPU processes run on many DFEs
≈50x MPC-X vs. multi-core x86 node
Each transaction executes on any DFE
in the assigned group atomically
CPU
CPU
CPU
CPU
CPU
Market and
instruments
data
Tail
Tail
Tail
Tail
Tail
Tail
Tail
analysis
Tail
analysis
Tail
analysis
Tail
analysis
analysis
analysis
analysis
onCPU
CPU
analysis
onCPU
analysis CPU
onCPU
analysis
onCPU
onCPU
on
on CPU
on
on CPU
on CPU
Instrument
values
36/83
DFE
DFE
DFE
DFE
DFE
Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Random number
Random number
Random number
Random number
Random number
Random number
generator and
generator
Random numberand
Random number
generator and
generator
Random numberand
Random number
generator and
generator and
sampling of and
sampling underliers
generator and
generator underliers
sampling of of underliers
sampling underliers
generator and
generator and
sampling of of underliers
sampling underliers
sampling of of underliers
sampling underliers
sampling of of underliers
sampling of underliers
Price instruments
Price instruments
Price instruments
Price instruments
Price instruments
Priceusing Black
instruments
using Black
Price instruments
Priceusing Black
instruments
using Black
Price instruments
Priceusing Scholes
instruments
Black
using Black
Scholes
using Scholes
Black
using Black
Scholes
using Scholes
Black
using Black
Scholes
Scholes
Scholes
Scholes
Scholes
42. Seismic Imaging
• Running on MaxNode servers
- 8 parallel compute pipelines per chip
- 150MHz => low power consumption!
- 30x faster than microprocessors
An Implementation of the Acoustic Wave Equation on FPGAs
T. Nemeth†, J. Stefani†, W. Liu†, R. Dimond‡, O. Pell‡, R.Ergas§
†
Chevron, ‡Maxeler, §Formerly Chevron, SEG 2008
42/83
43. The CRS Results
Performance of one MAX2 card vs. 1 CPU core
Land case (8 params), speedup of 230x
Marine case (6 params), speedup of 190x
CPU Coherency
43/83
MAX2 Coherency
47. P. Marchetti et al, 2010
Trace Stacking: Speed-up 217
• DM for Monitoring and Control in Seismic processing
• Velocity independent / data driven method
to obtain a stack of traces, based on 8 parameters
• Search for every sample of each output trace
2
t
2
hyp
2 T 2t0 T
= t0 + w m +
m H zy K N H T m + h T H zy K NIP H T h
zy
zy
v0
v0
(
2 parameters ( emergence angle & azimuth )
3 Normal Wave front parameters ( KN,11; KN,12 ; KN22 )
3 NIP Wave front parameters ( KNip,11; KNip,12 ; KNip22 )
47/83
)
50. Molecular Correlates of Tumor Signatures
from a Large Cohort
From whole slide sections, of a cohort,
to pathway analysis (Prof Bahram Parvin,
Berkeley)
High Content Analysis (HCA) on MPC-X
52. Conclusion: Nota Bene
This is about algorithmic changes,
to maximize
the algorithm to architecture match:
algorithmic modifications,
pipeline utilization,
data choreography,
and
decision making precision.
The winning paradigm of Big Data ExaScale?
52/83
53. Algorithmic Changes: Data Dependencies
PSI[0]
…
PSI[1]
OP
cbeta[0]
OP
cbeta[1]
PSI[N-3]
OP
…
…
0
OP’
OP’
…
PSI[0]
PSI[1]
PSI[2]
…
PSI[N-2]
PSI[N-1]
OP
cbeta[N-3]
OP’
PSI[N-3]
Example generated by Sasa Stojanovic (Gross-Pitaevskii)
cbeta[N-2]
OP’
0
PSI[N-2]
PSI[N-1]
53/83
55. Data Recoreography: Pipeline Utilization
Example generated by Sasa Stojanovic (Gross-Pitaevskii)
Order of data accesses
inside of a burst
…
…
…
55/83
56. Fixed Point: Savings Reinvestable
• Consider fixed point
compared to single precision floating point
• If the range is tightly confined,
one could use 24-bit fixed point
• If data has a wider range, may need 32-bit fixed point
hwFloat(8,24) hwFix(24,...)
Add
Multiply
hwFix(32,...)
500 LUTs
24 LUTs
32 LUTs
2 DSPs
2 DSPs
4 DSPs
• Arithmetic is not 100% of the chip.
In practice, often ~5x performance boost from fixed point.
56
57. Revisiting the Top 500 SuperComputers benchmarks
Our paper in Communications of the ACM
Revisiting all major Big Data DM algorithms
Massive static parallelism at low clock frequencies
Concurrency and communication
Concurrency between millions of tiny cores difficult,
“jitter” between cores will harm performance
at synchronization points
Reliability and fault tolerance
10-100x fewer nodes, failures much less often
Memory bandwidth and FLOP/byte ratio
Optimize data choreography, data movement,
and the algorithmic computation
New architecture of n-Programming paradigms
57/83
61. DAFNE = South (MaxCode) + North
(BigData)
MISANU, IMP, KG, NS,
UK
BSC, UPV,
Sweden
U of Siena, U of Roma,
Norway
IJS, FRI,
Denmark
Germany
IRB,
France
QPLAN,
Bogazici, U of Istanbul,
Austria
U of Bucharest, U of Arad,
Swiss
U of Tuzla,
Poland
Technion, Maxeler Israel, IPSI
Hungary
61/83
61/83
64. The TriPeak: Essence
MontBlanc = A ManyCore (NVidia) + a MultiCore (ARM)
Maxeler = A FineGrain DataFlow (FPGA)
How about a happy marriage?
MontBlanc (ompSS) and Maxeler (an accelerator)
In each happy marriage,
it is known who does what :)
The Big Data DM algorithms:
What part goes to MontBlanc and what to Maxeler?
64/83
64/83
65. TriPeak: Core of the Symbiotic
Success
An intelligent DM algorithmic scheduler,
partially implemented for compile time,
and partially for run time.
At compile time:
Checking what part of code fits where
(MontBlanc or Maxeler): LoC 1M vs 2K vs 20K
At run time:
Rechecking the compile time decision,
based on the current data values.
65/83
65/83
67. Maxeler: Research (Google: good
method)
Structure of a Typical Research Paper: Scenario #1
[Comparison of Platforms for One Algorithm]
Curve A: MultiCore of approximately the same PurchasePrice
Curve B: ManyCore of approximately the same PurchasePrice
Curve C: Maxeler after a direct algorithm migration
Curve D: Maxeler after algorithmic improvements
Curve E: Maxeler after data choreography
Curve F: Maxeler after precision modifications
Structure of a Typical Research Paper: Scenario #2
[Ranking of Algorithms for One Application]
CurveSet A: Comparison of Algorithms on a MultiCore
CurveSet B: Comparison of Algorithms on a ManyCore
CurveSet C: Comparison on Maxeler, after a direct algorithm migration
CurveSet D: Comparison on Maxeler, after algorithmic improvements
CurveSet E: Comparison on Maxeler, after data choreography
CurveSet F: Comparison on Maxeler, after precision modifications
67/83
67/83
68. Maxeler Research in Serbia:
Special Issue of IPSI Transactions
Journal
KG: Blood Flow, Tijana Djukic and Prof. Filipovic
NS: Combinatorial Math, Prof. Senk and Ivan Stanojevic
MISANU: The SAT Math, Zivojin Sustran and Prof. Ognjanovic
ETF: Meteorology, Radomir Radojicic and Marko Stankovic
ETF: Physics (Gross Pitaevskii 3D real), Sasa Stojanovic
ETF: Physics (Gross Pitaevskii 3D imaginary), Lena Parezanovic
68/83
68/83
69. Maxeler Research WorldWide:
Special Issue of Advances in Computers @ SCI
Stanford, Texas,
Imperial, Maxeler,
ETF, MF, MISANU, IMP, KG, NS,
BSC, UPV,
U of Siena, U of Roma,
IJS, FRI, …
69/83
69/83
71. Maxeler: Teaching (Google: prof
vm) VLSI, PowerPoints, Maxeler:
TEACHING,
Maxeler Veljko Explanations, August 2012
Maxeler Veljko Anegdotic,
Maxeler Oskar Talk, August 2012
Maxeler Forbes Article
Flyer by JP Morgan
Flyer by Maxeler HPC
Tutorial Slides by Sasha and Veljko: Practice (Current Update)
Paper, unconditionally accepted for Advances in Computers by Elsevier
Paper, unconditionally accepted for Communications of the ACM
Tutorial Slides by Oskar: Theory (7 parts)
Slides by Jacob, New York
Slides by Jacob, Alabama
Slides by Sasha: Practice (Current Update)
Maxeler in Meteorology
Maxeler in Mathematics
Examples generated in Belgrade and Worldwide
THE COURSE ALSO INCLUDES DARPA METHODOLOGY FOR MICROPROCESSOR DESIGN,
with an example
71/83
71/83
72. Maxeler PreConference Tutorials (2013)
Google:
IEEE HiPeak, Berlin, Germany, January 2013
ACM iSAC, Coimbra, Portugal, March 2013
IEEE MECO, Budva, Montenegro, June 2013
ACM ISCA, Tel Aviv, Israel, June 2013
72/83
72/83
76. How to Become a Family Member?
Options to consider:
a. MAX-UP free of charge
b. Purchasing a university-level machine
(min about $10K)
c. Purchasing a JPM-level machine
(slowly approaching $100M),
or at least a Schlumberger-level machine
(slowly moving above $10M)
76/83
76/83
77. Good to Know!
Maxeler employs close to 100 people, GBR and USA:
a. Maxeler cash burn per year = about $10M
b. If a university-level machine is sold at the 100% profit margin,
the company life of Maxeler is extended for about 2 hours.
c. If a university-level machine is sold at the 1% profit margin,
the company life of Maxeler is extended for 1 minute.
Our past or ongoing FP7 projects requiring Maxeler speeds:
a. ProSense
b. ARTreat
c. HiPEAC
77/83
77/83
78. The Educational Mission
Important note:
a. Total number of accredited universities in the whole world?
b. As per WeboMetrics, about 20000.
c. Consequently, all universities of the world together bring only:
20000 minutes of extra life, or about two weeks of extra life.
The reality:
a. University-level machines are sold at the ZERO profit margin!
b. Only the Xilinx costs, handling, and shipping.
c. Email support for student doing thesis is practically unlimited!
Conclusion: This is a chance for those who jump in first :)
78/83
78/83
79. Our Work Impacting Maxeler
Milutinovic, V., Knezevic, P., Radunovic, B., Casselman, S., Schewel, J., Obelix
Searches Internet Using Customer Data, IEEE COMPUTER, July 2000 (impact
factor 2.205/2010).
Milutinovic, V., Cvetkovic, D., Mirkovic, J., Genetic Search Based on Multiple
Mutation Approaches, IEEE COMPUTER, November 2000 (impact factor
2.205/2010).
Milutinovic, V., Ngom, A., Stojmenovic, I., STRIP --- A Strip Based Neural Network
Growth Algorithm for Learning Multiple-Valued Functions, IEEE TRANSACTIONS
ON NEURAL NETWORKS, March 2001, Vol.12, No.2, pp. 212-227.
Jovanov, E., Milutinovic, V., Hurson, A., Acceleration of Nonnumeric Operations
Using Hardware Support for the Ordered Table Hashing Algorithms, IEEE
TRANSACTIONS ON COMPUTERS, September 2002, Vol.51, No.9, pp. 1026-1040
(impact factor 1.822/2010).
79/83
79/83
80. Maxeler Impacting Our Work
Tafa, Z., Rakocevic, G., Mihailovic, Dj., Milutinovic, V., Effects of
Interdisciplinary Education On Technology-driven Application Design IEEE
Transactions on Education, August 2011, pp.462-470. (impact factor
1.328/2010).
Tomazic, S., Pavlovic, V., Milovanovic, J., Sodnik, J., Kos, A., Stancin, S.,
Milutinovic, V., Fast File Existence Checking in Archiving Systems ACM
Transactions on Storage (TOS) TOS Homepage archive, Volume 7 Issue 1,
June 2011, ACM New York, NY, USA.
Jovanovic, Z., Milutinovic, V., FPGA Accelerator for Floating-Point Matrix
Multiplication, IEE Computers & Digital Techniques, 2012, 6, (4), pp. 249256.
Flynn, M., Mencer, O., Milutinovic, V., Rakocevic, G., Stenstrom, P., Trobec,
R., and Valero, M., Moving from Petaflops (on Simple Benchmarks) to
Petadata per Unit of Time and Power (On Sophisticated Benchmarks)
Communications of the ACM, May 2013 (impact factor 1.919/2010).
80/83
80/83
81. Current Main Efforts of Maxeler
1. To encourage a lot of software to be written/ported.
This is a key business opportunity that needs to be developed.
2. Maxeler is building up a website and a community
to share software for DFEs.
This would allow the software to also be sold
directly from the Maxeler website.
3. If a PhD student ports an important software
to a Maxeler machine,
she/he could become the first software vendor in the world
for dataflow computers,
and Maxeler would be happy to help sell licenses.
81/83
82. Current Side Efforts of Maxeler
1. Developing new tools for easier making of kernels.
2. Bringing new languages to Maxeler:
C, C++, MathLab, Matematika
3. Porting popular application packages to Maxeler:
OpenSees, etc...
4. Trying the Tabula FPGA!
5. Getting more than 1TeraByte/sec thru I/O
6. Minimizing the hardware, so it can go into Galaxy 5,6…
82/83
84. Getting Started a Practical Work
from the Linux Shell
1. Open a shell terminal (e.g., $ /usr/bin/xfce4-terminal).
2. Connect to the Maxeler machine
(e.g., $ ssh root@147.91.12.216).
3. If more shell screens needed, start screen (e.g., $ screen).
4. Switch to the directory that contains
the 2n+3 programs you wrote
(e.g., $ cd Desktop/workspace/src/ind/z88/).
5. Prepare your C code for measuring the execution time
(e.g., clock_gettime(CLOCK_REALTIME, &t2);).
6. See what you can do (e.g., $ make).
7. Select one of those that you can do
(e.g., $ make build-sim, $ make run-sim,
$ make build-hw, $ make run-hw).
8. Measure the power consumption at the wall plug.
84/83