Streams on wires

Streams on Wires — A Query Compiler for FPGAs

Rene Mueller Jens Teubner Gustavo Alonso
rene.mueller@inf.ethz.ch jens.teubner@inf.ethz.ch alonso@inf.ethz.ch
Systems Group, Department of Computer Science, ETH Zurich, Switzerland

ABSTRACT The Avalanche project at ETH Zurich addresses the chal-
Taking advantage of many-core, heterogeneous hardware for lenging questions raised by these scenarios. In Avalanche,
data processing tasks is a difficult problem. In this paper, we we study the impact of modern computer architectures on
data processing. As part of the efforts around Avalanche,
:
consider the use of FPGAs for data stream processing as co-
processors in many-core architectures. We present Glacier, we assess the potential of using FPGAs as additional cores
a component library and compositional compiler that trans- in many-core machines and how they could be exploited for
2010.06.17
forms continuous queries into logic circuits by composing
library components on an operator-level basis. In the pa-
data processing purposes.
In this paper, we report on the work we have done devel-
per we consider selection, aggregation, grouping, as well as oping Glacier, a library of components and a basic compiler
windowing operators, and discuss their design as modular for continuous queries implemented on top of an FPGA.
elements. The ultimate goal of this line of work is to develop a hy-
We also show how significant performance improvements brid data stream processing engine where an optimizer dis-
can be achieved by inserting the FPGA into the system’s tributes query workloads across a set of CPUs (general-
data path (e.g., between the network interface and the host purpose or specialized) and FPGA chips. In here, we focus
CPU). Our experiments show that queries on the FPGA on how conventional streaming operators can be mapped to

Abstract
ABSTRACT The Ava
Taking advantage of many-core, heterogeneous hardware for lenging qu
data processing tasks is a diﬃcult problem. In this paper, we we study
consider the use of FPGAs for data stream processing as co- data proc
processors in many-core architectures. We present Glacier, we assess
a component library and compositional compiler that trans- in many-c
forms continuous queries into logic circuits by composing data proce
library components on an operator-level basis. In the pa- In this p
per we consider selection, aggregation, grouping, as well as oping Glac
windowing operators, and discuss their design as modular for contin
elements. The ultim
We also show how signiﬁcant performance improvements brid data
can be achieved by inserting the FPGA into the system’s tributes q
data path (e.g., between the network interface and the host purpose or
CPU). Our experiments show that queries on the FPGA on how co
can process streams at more than one million tuples per circuits on
second and that they can do this directly from the network, over data
removing much of the overhead of transferring the data to to operate
a conventional CPU. actual per
system.
The pap

uples per
network, over data streams; the proper system setup for the FPGA
e data to
Contributions
to operate in combination with an external CPU; and the
actual performance that can be reached with the resulting
system.
The paper makes the following contributions:

1. We describe Glacier, a component library and compiler
rds many-
for FPGA-based data stream processing. Besides clas-
pportuni-
sical streaming operators, Glacier includes specialized
rocessing
building blocks needed in the FPGA context. With the
t go away
operators and specialized building blocks, we show how
r the I/O
Glacier can be used to produce FPGA circuits that im-
ow to ex-
plement a wide variety of streaming queries.
ow to deal
host com- 2. Since FPGAs behave very diﬀerently than software, we
machines provide an in-depth analysis of the complexity and per-
provides formance of the resulting circuits. We discuss latency
PUs) next and issue rates as the relevant metrics that need to be
dors have considered by an optimizing plan generator.
e machine
e ﬂoating 3. Finally, we evaluate the end-to-end performance of an
de larger FPGA inserted between the network and the CPU, run-
be CPUs ning a query compiled with Glacier. Our results show
that the FPGA can process streams at a rate beyond one
million tuples per second, far more than the CPU could.

ed provided Our work is organized as follows. The upcoming Sec-

which typically indicate a turbulent market situation. At Table 1: Supported streaming a

Motivating Applications
the same time, latency is critical and measured in units of names; q, qi : sub-plans; x: para
microseconds (µs). input).
To abstract from the real application, we assume an input
stream that contains a reduced set of information about each
trade handled by Eurex (the actual streams are implemented of UBS stocks (similar functionality
with a Swiss bank
as a compressed representation of the feature-rich FIX pro- plement finite-impulse response filter
tocol [4]). financial trading application receives data from a setdetermines the average trade prices
Their Expressed in the syntax of StreamBase [16], the of streams with up-to-date
market information from different stock exchanges.
schema of our data would read: over the last ten-minutes window:
CREATE INPUT STREAM Trades ( SELECT count () AS Number
Seqnr int, -- sequence number FROM Trades [ SIZE 600 ADVANC
Symbol string(4), -- valor symbol WHERE Symbol = "UBSN"
Price int, -- stock price INTO NumUBSTrades
Volume int) -- trade volume
SELECT wsum (Price, [ .5, .25, .12
To keep matters simple, we look at queries that process FROM (SELECT * FROM Trades
a single data stream only. In particular, we disallow the WHERE Symbol = "UBSN")
use of joins. To facilitate the allocation of resources on the [ SIZE 4 ADVANCE 1 TUPLE
FPGA, we restrict ourselves to queries with a predictable INTO WeightedUBSTrades
space requirement. We do allow aggregation queries and SELECT Symbol, avg (Price) AS Av
windowing; in fact, we particularly look at such functionality FROM Trades [ SIZE 600 ADVANC
in the second half of Section 4. GROUP BY Symbol
These restrictions can be lifted with techniques that are INTO PriceAverages
no different to those applied in software-based systems. The We use these five queries in the fol
necessary FPGA circuitry, however, would introduce addi- various features as well as the composi
tional complexity and only distract from the FPGA-inherent compiler.
considerations that are the main focus of this work.
2.3 Algebraic Plans
2.2 Example Queries Input to our compiler is a query rep

1 x|k,l 2 Input to our compiler is a
Our first set of example queries substituted on each wind.;
apply q2 with x is designed to illustrate bra for streaming queries. O

Example Queries
a hardware-based implementation for the most basic oper-
t ∈ {time, tuple}: time-, or tuple-based the algebra dialect listed in T
ators 1in stream processing. Queries Q1 and Q2 field simple
q q2 concatenation; position-based use join be composed in an arbitrary
projections as well as selections and compound predicates: Our algebra uses an “asse
Table 1: Price, Volume
SELECT Supported streaming algebra (a, b, c: field breaks down selection, arithm
names; q, qi : sub-plans; x: parameterized sub-plan
FROM Trades into separate algebra operat
input). Symbol = "UBSN" (Q1 ) simple projections as well as selections and
similar representation also su
WHERE compound predicates

INTO UBSTrades tics of XQuery [11]. In the c
notation turns out to have n
ofSELECTstocks (similar functionality is used, e.g., to im-
UBS Price, Volume flow in a hardware circuit an
FROM Trades
plement finite-impulse response filters). Finally, Query 2 )5
(QQ for parallel evaluation.
determines the average trade prices for > 100000 symbol
WHERE Symbol = "UBSN" AND Volume each stock
Figure 1 illustrates how ou
overINTO last ten-minutes window:
the LargeUBSTrades to express the semantics of Q
SELECT count () AS Number
Financial analytics often depend on statistical information how in Figure 1(a), e.g., ope
counts the number of trades of UBS shares over
FROM Trades [ SIZE 600 ADVANCE 60 TIME ]
from the data stream. Using sliding-window and grouping) the last 10 each (600 seconds) and returns the
of minutes input tuple with the r
(Q3 aggregate every minute.
WHERE Symbol =Q3 counts the number of trades of UBS
functionality, Query "UBSN" explicit. Its output, the new
shares INTO the last 10 minutes (600 seconds) and returns
over NumUBSTrades to filter out non-qualifying t
the aggregate every minute. In Query Q.125 ]) AS Wprice
SELECT wsum (Price, [ .5, .25, .125, 4 , we assume the The concatenate operator
FROM (SELECT * FROM Trades
presence of an aggregation function wsum that computes the the presence of an aggregation by position”. T
called a “join function wsum that
weighted sum WHEREthe prices"UBSN") the last four trades)
over Symbol = seen in (Q4 computes the weighted sum over the prices seen in
are combined into a wide re
the last four trades of UBS stocks (similar
[ SIZE 4 ADVANCE 1 TUPLES ] functionality is used
INTO WeightedUBSTrades
SELECT Symbol, avg (Price) AS AvgPrice
FROM Trades [ SIZE 600 ADVANCE 60 TIME ] the average trade prices for each stock symbol
(Q5 ) over the last ten-minutes window
GROUP BY Symbol
INTO PriceAverages

Algebraic Plans
πa1 ,...,an (q ) projection
σa (q ) select tuples where field a contains true
○a:(b1 ,b2 ) (q ) arithmetic/Boolean operation a = b1 b2
llabora- q1 ∪ q2 union
plication
agg b:a (q ) aggregate agg using input field a ,
market
agg ∈ {avg, count, max, min, sum}
rmation
packages q1 grpx|c q2 ( x ) group output of q1 by field c, then
a rate at invoke q2 with x substituted by the group
s rate is q1 t x|k,l q2 ( x ) sliding window with size k , advance by l ;
5]. apply q2 with x substituted on each wind.;
] cannot t ∈ {time, tuple}: time-, or tuple-based
ential fi- q1 q2 concatenation; position-based field join = join by position
uations,
ion. At Table 1: Supported streaming algebra ( a, b, c: field
units of names; q, qi : sub-plans; x : parameterized sub-plan
input).
an input
out each

.2 q1 Example Queries
t
q2 (x) sliding window with size k, advance by l; Input to our compiler is a query representation
x|k,l
Our first set of example queries substituted on each wind.;
apply q2 with x is designed to illustrate

Algebraic Plans
bra for streaming queries. Our compiler currentl
hardware-based implementation for the most basic oper-
t ∈ {time, tuple}: time-, or tuple-based the algebra dialect listed in Table 1, whereby ope
tors 1in stream processing. Queries Q1 and Q2 field simple
q q2 concatenation; position-based use join be composed in an arbitrary fashion.
rojections as well as selections and compound predicates: Our algebra LargeUBSTradesLargeU BS Trades NumU BS Trades
uses an “assembly-style” represent
NumUBSTrades
UBSTrades π
Table 1: Price, Volume
SELECT Supported streaming algebra (a, b, c: field breaks down selection, arithmetics, and predicate
U BSPrice,Volume
Trades
time
πPrice,Volume
x|600,60 time Wei
x|600,60
names; q, qi : sub-plans; x: parameterized sub-plan
FROM Trades into separate algebra operators. In earlier work, w
πPrice,Volume σc
πPrice,Volume Trades σc coun t Number
input). Symbol = "UBSN" (Q1 ) similar representation also suited to express,countNum
○c:(a,b)
∧ Trades
e.g.,
WHERE σa ○c:(a,b)
∧
σa σa
INTO UBSTrades > σa In the
tics of XQuery○b:(Volume,100000)> context of the current
[11]. σa
○b:(Volume,100000)
○a:(Symbol,"UBSN")
= ○a:(Symbol,"UBSN") ○a:(Symb
notation turns a:(Symbol,"UBSN")
○ ○a:(Symbol,"UBSN") nice correspondences t
= out to have
= = =
○a:(Symbol,"
=
ofSELECTstocks (similar functionality is used, e.g., to im-
UBS Price, Volume ○a:(Symbol,"UBSN")
=
flow in a hardware circuit and helps detecting opp
FROM Trades Trades Trades x Trad
plement finite-impulse response filters). Finally, Query 2 )5 Q Trades Trades x
(Q for parallelQevaluation.Q2
(a) Query 1 (b) Query (c) Query Q3
determines the average trade prices for > 100000 symbol
WHERE Symbol = "UBSN" AND Volume each stock
FigureBS Trades
(a) Query Q1 (b) Query Q2
1 illustrates how our streaming algebra c
(c) Query Q3

overINTO last ten-minutes window:
the LargeUBSTrades LargeU BS Trades NumU
to express the semanticsAlgebraic query plans for PriceAver
Figure 1:
Queries Q1 throughfive
of Figure 1: Algebraic query plan the
Q
U BS Trades πPrice,Volume time
WeightedU BS Trades
SELECT count () AS Number
Financial analytics often depend on statistical information LargeUBSTrades how in Figure 1(a), e.g., operator ○ makes the c
NumUBSTrades x|600,60
=
πPrice,Volume σc 6-to-1 lookup tables 69,120 a time
piece
FROM Trades [ SIZE 600sliding-window and grouping
rom the data stream. Using ADVANCE 60UBSTrades
TIME ] tuple
Tradesinput tuple 6-to-1 lookup requested stock symbo
(Q3 )∧ c:(a,b) of each flip-flops (1-bit flip-flopsthe registers)
πPrice,Volume with
time countNumber registers) tables
69,120
x|4,1
69,120 ing the
PriceAverages x|600,
WeightedUBSTrades
WHERE Symbol = "UBSN" ○ x|600,60 (1-bit 69,120
unctionality, Query Q3 counts the number of trades of UBS πPrice,Volumeσa c explicit. block RAM block RAM 296×18WPrice:(Price,[··· ]) is usedgra
Trades
output, the a
25×18-bit
countNumber
new column a, kbit RAM
Its σa multipliers σtuple wsum kbit 296×18 Trades
64
time
progra
INTO the last 10 minutes (600 seconds) and returns
NumUBSTrades ○b:(Volume,100000)
> x|4,1
25×18-bit multipliers (operator σ ).
x|600,60
64 progra
hares over ○a:(Symbol,"UBSN") ○c:(a,b)
= ∧ to filter out non-qualifying rate MHz
typical clock rate
○a:(Symbol,"UBSN")typical clock
=
tuples
100
= 100 MHz tribute
ax
a ○a:(Symbol,"UBSN")
= x
wsumWPrice:(Price,[··· ]) Trades grpy|Sym
he aggregate every minute. In Query Q.125 ]) AS ○b:(Volume,100000)
SELECT wsum (Price, [ .5, .25, .125, 4 , we assume the Wprice
> The concatenate operator
a a
Table 2: Xilinx XC5VLX110T characteristics.
represents what the FP
FROM (SELECT * FROM Trades ○a:(Symbol,"UBSN")
resence of an aggregation function wsum that computes the Trades called a:(Symbol,"UBSN") ○a:(Symbol,"UBSN") Tuples from both inp
= Trades
= ○ a “join xby =
= position”. x Trades
Table 2: Xilinx XC5VLX110T characteristics. x A pa avg
of (e)
cont
weighted sum WHEREthe prices"UBSN") the last four Q1 (Q4 ) Query Q2 they arrive.Queryoperator a necessary, for instance, to evalu- the tionalQ
over Symbol = seen in (a) Query
Trades
trades (b)
Trades
are combined Q3
(c) (d) Query Q4
The into is wide result tuple in
x they arrive.Trades
orde m
[ SIZE 4 ADVANCE 1 TUPLES ] ate and return different aggregation functions over thefor instance, to dat
The operator is necessary, same
by
eva
(a) Query Q1 (b) Query Q2 1: Algebraic query plans return different aggregation functions overQuery
Figure input stream.3 ate and for the five example queries Q1 to Q5the sa
(c) Query Q (d) Query Q4 .
(e) Typica
INTO WeightedUBSTrades input stream.
LargeUBSTrades NumUBSTrades the add
UBSTrades Symbol, avg (Price) AS AvgPrice
SELECT 3. FPGAS FOR STREAM PROCESSING to
6-to-1Figure tables PriceAverages plans for the five example queries Q1 to Q5 .the RAM c
lookup 1: Algebraic query tionalit
πPrice,Volume time
WeightedUBSTrades
69,120
3. FPGAS FORof data, its address, flip-flop reg
a piece
STREAM PROCESSING constan
FROM Tradesσ [ SIZE 600x|600,60 ADVANCE 60 TIME ] flip-flops (1-bit registers) time its69,120heart, every FPGA chip dataFPGA chipFPGAs, of three mr
At very ing the consists of three main
back. In
πPrice,Volume c block RAM5 )
tuple (Q types of components. its a piece of data, distributed overthe RAM and tig
296×18 kbit At A large heart, every
very number of lookup tables (LUTs)chip
RAM are consists
the We
GROUP BY Symbol Trades countNumber 6-to-1 lookup tables
x|4,1
25×18-bit multipliers
69,120
x|600,60 its address, to lookup tables (LU
64 types of ing the of logic large number lookup
components. A gates. Each of
provides a programmable type data back. In FPGAs,addition, registers
programmable logic. In
chip, t
tion prt
lookup
∧ c:(a,b) flip-flops (1-bit registers) 69,120 flip-flop
INTO PriceAverages programmed at runtime gates. beimplem
provides aarbitrary 6 bit → 1 bit logicand thusEach look
programmable type of function. used a
σa σa σa wsumWPrice:(Price,[··· ]) Trades kbit MHz
typical clock rate
block RAM table100 implement an
can
296×18 grpy|Symbol RAM are distributed over the chip and tightly w memor
> b:(Volume,100000) 25×18-bit multipliers Lookup tables are wired through an an arbitrary such, → 1 bit tables
64
table can implement memory. As 6 bit the on-chip sto
tributed
programmable interconnectan interconnect netwo
network functi
logic. In addition, lookup The ac
= a:(Symbol,"UBSN") these five queries in the following to demonstrate Lookup tablesthe FPGA can be accessed in a truly paral
are wired through
We use = = a:(Symbol,"UBSN") = a:(Symbol,"UBSN") Xilinx XC5VLX110T x can route that canacross the chip.runtime and thus Finally, rare up
Table 2:
typical clock rate
x 100 MHz avgAvgPrice:Price programmed at across the flip-flops used as add
that signals
characteristics. Finally,
route 1-bit storage units that can
signals be
A particular use ofchip. potential flip-flo
this is the
a:(Symbol,"UBSN") (also called registers) provide memory. As such, the on-chip storage r
tributed
various features as well as the compositionality of the Glacier provide 1-bit storage unitsthat is
(also called registers)logic.
of content-addressable memory (CAM). c
directly be wired into the remaining be accessed in a truly parallel fasthat O
Trades Trades x Table 2: Xilinx XC5VLX110T characteristics.yto evalu-the wired into the remaining logic.
Trades FPGA can
they arrive. The operator is necessary, Theinstance, directly betables, the wiring of the intercon-
for behavior of lookup lookup
tional memory, content-addressable memor
compiler. (b) Query Q A particular use of this potentialof thelatency
is theinterco
ate and return different aggregation nect, Query Q5 initial state of byof lookup tables, all be con- explicit me
functions over theThe behavior data values, rather wiring
impl
(a) Query Q1 2 (c) Query Q3 (d) Query Q4 (e) and the same the flip-flops can the than by
input stream. nect, The content-addressable the flip-flops can all a co
of initial state of memory (CAM). Other t
figured by software. and the Typically,content-addressable memory begiv
implem
2.3 Algebraic Plans actual configuration are typically resolve can
they arrive. The operator is necessary, for instance, to evalu- CAMs is used to
tional memory,
figured by software. Thelanguageconfiguration istwo cy
actual (such as typica
Figure 1: Algebraic query plans for the five example queries Q1 to Q5described using a hardware descriptionrather than by explicit More ge
.
ate and return different aggregation functions over the same the address it has been stored at.
byusing a hardware description language (such
data values, memory
described
VHDL or Verilog) and loadedtionality can implement an arbitrary key-
into the FPGA.

FPGAs for Stream Processing
notification algebraic VHDL
query plans Code Xilinx
query plan CPU Glacier
Synthesizer
network the FPGA is directly connected to the physical network interface,
NIC π σ (a) with parts of the network controller implemented inside the
Main FPGA fabric.
data Memory
stream Figure 3: Compilation of abstract que
FPGA
notification hardware circuits for the FPGA.

query plan CPU
sub-plan adheres to the same well-defined wi
data π σ (b)
stream Main
Memory
4.1 Wiring Interface
the FPGA can also be used in a traditional co-processor setup

As in data streaming engines, our processin
FPGA
tirely push-based. Each n-bit-wide tuple is
a set of n parallel wires in the FPGA fabric
Figure 2: System Architectures: (a) Stream engine
wires, a new tuple can be propagated in eve
between network interface and CPU. (b) Stream
FPGA’s system clock (i.e., 100 million tuple
engine as a coprocessor to the CPU.
An additional data valid line signals the pres
in a given clock cycle. Tuples are only consid
of the data stream if their data valid flag is s
tem main memory, then informs the host CPU about the
if the data valid line carries an electrical “hig
arrival of new data (e.g., by raising an interrupt).
The can also Figure 2 (a) fits pattern that is highly common inthe following, we use rectangles to repre
Alternatively, the FPGA architecture inbe used in aa traditional
In data stream applications.
ponents (with the exception of multiplexers
co-processor setup, as illustrated in Figure 2 (b). Here, the
use the common trapezoid notation). Our
CPU hands over data to the FPGA either by writing directly
clock-driven or synchronized and every oper
into FPGA registers (so-called slave registers) or it prepares
brary writes its output into a flip-flop regi

Query Compilation

algebraic VHDL
query plans Code Xilinx
CPU Glacier
Synthesizer
(a) circuit
on FPGA
Main
emory
Figure 3: Compilation of abstract query plans into
hardware circuits for the FPGA.
The compiler applies compilation rules (Section 4) and optimization heuristics
(Section 6), then emits the description of a logic circuit that implements the input plan.

CPU
sub-plan adheres to the same well-deﬁned wiring interface.
(b)
Main 4.1 Wiring Interface
emory
As in data streaming engines, our processing model is en-
tirely push-based. Each n-bit-wide tuple is represented as
a set of n parallel wires in the FPGA fabric. On a set of
ream engine
wires, a new tuple can be propagated in every cycle of the
(b) Stream
FPGA’s system clock (i.e., 100 million tuples per second).

down the data path, as shown in R ule Proj: & &
rator’s leftmost observableFor
orresponds to the output.com-
rectangles to represent logic response time, whereas
4: Compiled hardware execution plan for
ue we multiplexers, throughput.
rate determines black-box
,ion of depict thefor which we q

Wiring Interface
Qhardware Our circuits are all circuit is 3, issue rate 1.
a 1 . Latency of this
d notation). implementation
Synchronization the right.
zed and every operator in our li- πa1 ,...,an (q) ⇒ ··· .1 (Proj) 1
ery q as shown on after pro-
nto a flip-flop register also need to be considered dur-
a1 an ∗
se arrows sometimes boxes and
properties to denote the wiring between hardware
gisters as gray-shaded q
ry compilation. For instance, all compilation rules
plansthat a generated circuit willwe identifythis islines T his implementation for πa1 ,...,an has an interesting side
ents. Wherever appropriate, never try to those also the rate
explicit as
nsure have an issue rate of one, push
tuple For that correspondan operator thattuple field
utput.
ples in bus to a specific has
black-box plan.cycles into
successive
omplete than one. We use two types of logic Our circuits are allOclock-driven or synchronized and every
q effect. ur compiler emits the description of a hardware cir-
2
abel at less respective input/output port. The labelin ourthat iswrites its output into synthesizer to generate the actual
e rate the
mentation cuit library passed into a a flip-flop register
operator
ds for “all remaining fields”. We between sub- after processing. configuration for the F P G A . T he synthesizer op-
nentsright.
the to implement the synchronization do not represent hardware
Union
:
ote the wiring betweentuple. The hardware plan for the
r of fields within a hardware timizes out “dangling wires”, effectively implementing pro-
Figure 5: Compilation r
expression σa (q)flag thus be for streams with of an algebraic for free. (shown for an instance
ppropriate, we identify those lines illustrated as
a data flow data can registers
queues act asvalidpoint of view, the task
short-term buffers
jection pushdown
orrespond to a specific tuple field T here is no actual work to do at runtime (though fields
varying data rate. T hey emit data at a predictable
tive input/output port. The label
perator ∪ is notrate of an upstream sub-circuit.
ate, typicallyWe do
ning fields”.
to accumulate the outputare propagated into a new set ofin parallel). and
the issue represent
of several open registers). L atency
treams hardware the for theoutput stream. Since,rate of this implementation for pro jection are both 1.
tuple. The runtime, single .
& as with the output rate.
σa(q) issue in our
aote that, atinto a plan average input rate must not
xceed what illustrated
can thus be is achievable 4.3 Arithmetics and Boolean Operations
source streams a ∗ truly in parallel,circuit, the logical ʻandʼ gate invalidates the output
operate In this a hardware
n most practical cases, the depth of the F I F O can be tuple wheneverindicated in false. 1, we use the generic ○a:(b ,b ) op-
A s field a contains Table
ntation for ∪ needs to ensure proper synchroniza- arithmetic computations, value compar-
q
ept very low.. T his not only implies a small resource erator to represent
1 2

eotprint, but also means that the impact onports usingisons or Boolean connectives in relational plans. Sub-plan
do ∗ by buffering all input the over-
a so FIFOs: current window. T he in-
ircuit, the typically ‘and’ gate invalidates the output
l latency is logical small. stance every execution and may, e
q
henever field a contains false. FIFO
−n
operatorsinvalidates block data items for a fixed num-
can the union
z 4-way output 1
○tically sound manner.
= a:(b ,b ) (q) ,
2
‘and’ gate FIFOs
ntains false.Characteristics e.g., to properly syn-
Circuit
er of cycles n. T his can be used, e.g., will emit all fields in q, Our compiler field a that
ex tended by a new implements
hronize the output of slow arithmetic operators with contains the outcome of b1 = b2a template circuit (full
bove circuittuple flow (the circuit below implements clock
teristics
he remaining will compute its output in a single into .
d will its output in to consume a of input tuple at T his semantics directlyure 5). We an implementation add
○a:(b ,b )be ready a that the latency new is 2):
(q); assume single clock
translates into
introduce an
teconsume clock.inside the union component logic (wefor- a similar circuit a moment ago):
ompute
1 machine input tuple that its latency and issue then saw
2
ck of the a new We say at
in
to
uples1. In its the input FIFOs in more than one fashion
We say from latency circuits may need a round-robin
stream”) next to the data v
both that general, and issue
a
l, circuits may need theirthan one
a Delay operators (q) ⇒ a window closes to notify t
more computation can be picked up ○ = a:(b ,b ) = (Equals)
heir them as can beunion result.
ts the result of −2 picked −2
computation the
til 1 2
.
up
perator output—they larger a latency that is larger the union fixed number of cycles n bof ∗that window. T
z
theyeverylatency that is have
gh
z .
have a individual input may feed z can block data items for a
into
−n elements 2
b1
Due circuits semantics, ∗
b1 b2
mantics,to theirthat implement circuits that implement sub-plan to start generating
q
g or at an arbitrary produce output before they
ent windowing cannot they rate (i.e., issue rate 1), the
annot produce output before tuple
q
of the respective query streams together must Most simple arithmetic or Boolean operations userun within
rate of all input window. not exceed A common will case are s
en the last to be theof the respective query window. single clock cycle. More complex tasks, such as multipli-
tuple number of a
efine latency ples belong to several wind

Synchronization
& & & & of Though[9] ain orindividual input may feed may . require (Proj)
cation / division, a straightforward fashion. To implement
π 1 ,...,a floating-point arithmetics, into the union
state) every n (q) ⇒ ···
earlier how our assembly-style selection WPrice
rice WPrice
opera- coun t , e.g., we use Sometimes, counter component and wire
additional latency. a standard the actual acircuit that im-
a1 n ∗
WPriceneed WPriceconsidered dur- component at an arbitrary tuple rate (i.e., issue rate 1), the
operties sometimes also circuit. Compilation
to be plements can be tuned within the trade-offs latency, is-

Query Units
be cast into a hardware q
wsum eosFor instance, all compilation rules average rate of all the data valid signal of the input stream.
wsum eos wsum eos wsum
compilation. translation in the notation we also
its trigger input to input streams together must not exceed
ormalizes this Once weand chip spaceper cycle, whichthe latency of(a) emit
sue rate, reach consumption. If is
eos
ure that a of this&
remainder generated circuit will never try toto
work. We & the ⇒ &
use
more than one the endoperatorsπhave to is the maximum tu-
tuple of the current stream, we
symbol push the counter implementation for a1 ,...,an be introduced to
greaterTthan one, delay
his value to the upstream data pathan interesting side
has and
&
s in successive cycles and, as before, assume synchronize the operator
effect. to component the forward (b) reset
ple rate thatur compiler emits the description of a up-stream
the union output with can remaininghardware cir-
e “compiles to” relationinto an operator that has the counterO zero to prepare for the next input stream. fields
rate 0 labeled q 1is the We 1usethat results from
angleless than one. circuit two types CSR1 of logic the shownpath. In terms of 4.1.2).
(as data that is passed into latency, the state machine inside
before in Section a
1 Incuit translation rulesinglesynthesizer to generate the actual
the for coun t a (q), process. The FIFOs
nts logical ‘and’ gate synchronization between sub- cy- operator requires a
he to implement the & completes within a single the
q: hardware configuration for cycleFtoG A . T he synthesizer op-
the P
T herefore, the∗latency and the issue rate of the circuit the input, ith the rules we have either flip-flop can now or
at timizes W implementedwires”, seen so far, implementing pro-
Example. out “dangling using effectively we registers
rated for σa are both 1. eos a
block RAMpushdown fortrade-off), a hardware circuit. In cy-
translate our (a resource query into add another latency
jection first example free.
we the & σa
ere,σa& act ⇒ short-term buffers .for a1 ,...,a(Sel) discard igure Tthereillustrated the circuit to has at minimum(though fields
eues(q)use as pro jection operator π streams with cle.coun a (q) ⇒ 4, overall circuit therefore do a runtime (Count)
F The we is no actual work that results from applying of , latency
n to counter
rst
srying datatuple flow. Supportdatafieldarenaming (often Depending on theto Q uery Q1set of registers). L atency and
from the a rate. T heyaemit for at predictable 2. compilation rules into a new . distribution, however, the
∗ ∗ our are propagated input data ∗
essed usingthe issue rateqof an upstream sub-circuit. observed latency this implementation reflects tuples queue up 1.
, typically the π operator) is a straightforward ex ten- T he hardware circuit be higher whenever the shape of
a issue rate of may quite literally for pro jection are both
q
eof what we present the averagea:(Symbol,"UBSN")
that, at runtime, here. ○ input rate must not in an input FIFO. E ach of the operators can individually
= = the algebraic plan.
scarding is field"UBSN" leavestuple output rate. means to Strictly speaking,signal to the data validOperations
t the Symbol achievable the all tuples essentially
resulting circuit ∗
ed what a from with the flow simply we forwardArithmetics and Booleanand issue sufficient
4.3
operate in a the eoscyclebinaryhave latency
single a (i.e., union component is rates
output register
wire instance, could tuples by set ting their data
invalidates discarded be ex- to one). s Since all plan operators are applied sequentially,
of
most the respective outputdepth of the F I F O canfurther implement the algebraic ∪ operator: generic ○a:(b1 ,b2 ) op-
practical cases, the ports with any inputs be to implement (a) and Table the same the
for NumUBSTrades A indicated in feed 1, we use signal into the reset
ntothe data Tpath, as shown in R ule nature time the
pressed using the very similar in Proj: to
his is
false. Trades query plan latencies add counter the circuit in F igure 4 has an overall t
up and to implement (b). Note
vectors”here following column
shown that Monet B / X implies a small resource input of the represent arithmeticissue rate of a that coun a
very low. T his notDonly 100 [13] uses to avoid x|600,60 erator three. B y contrast, the computations, value compar-
to
latency of a new output field without reading any particu- pipelined
print, but also On the that the impact on the over- constructsplanBoolean connectives slowest sub-plan. SinceT he in-
ng. the right.
on means hard- Trades isons or is determined by itsunion
execution 2-way in relational plans.
Hardware (q) ⇒ small. · · ·for Query Q4 . (Proj) input value. The operator emits no other field but the
atencyside, execution plan lar stance
ware is typically
πa1 ,...,an the semantics of .
a1 an ∗ minPrice q ∪q ⇒
maxaggregate1 (we2 handle grouping separately,. see next). For (Union)
q1 q2 is straightforward to
Price ∗ ∗
implement. can block di- items for a fixed num- the algebraic aggregates we=consider, coun t , sum, avg, m i n,
perators z −nWe simply data q σa ○a:(b1 ,b2 q(q) ,
q1 )2
es cycles n. Topen window. e.g., to properly syn- and max, the latency is one cycle. A tuple can be applied at
of thethe signals frombe used,
oldest his can ○a:(Symbol,"UBSN")
=
his implementation for all 1in- n has an interesting side e.g., will emit all fields in q, ex tended by a new field a that
rect πa ,...,a
nize fields to a common implements operators with As we willeveryin the cycle (the issue rate is 1).
put the output of slow arithmetic
he hardware circuit that out- x the input see clock following, however, the availability of
t. O ur compiler emits the description theasliding- cir- contains the outcome of b1 = b2 .
of hardware
remaining tuple Figure 6. circuitthe windowing
put register set. flow (the With below implements a general, n-way union implementation eases the implemen-
Q4 is passed into tuple gen-
A
Holistic Aggregate directly translates intoaggregate func-
T his semantics Functions. For some an implementation
that is shown in a synthesizer to generate the actual
erated thisassume thatmeaningful if both is 2): tuples were of other functionality.
(q); way onlyfor],the F P G Afour windows
(b1ADVANCE 1 TUPLES
is the latency of input tation logic (we saw a similar circuit a moment ago):
4 ,b2
ware )configuration a logical ‘and’ gate to combine them: in the state required is not within constant bounds. They
at most . T he synthesizer op- tions,
valid. Hence, we use
together at any point in effectively implementing pro-
zes out “dangling wires”, time. Hence, we in-
on pushdown wsum sub-plan. The window type
copies of the for free.
4.5 to batch (parts of) their input until the aggregate can
need Windowing
be computed when the end of the stream is seen. The pro-
a
is q ⇒
is tuple-based. −2 a
hereq1 no2 actual The &to do at runtime (though fields The concept,b2for such operators are the computation of
work ‘counter’ component .on (Concat) totype example )windowing bridges the gap between stream
○a:(b1 of (q) ⇒
= = . (Equals)
−2
z z
propagated tuples and sends the adv signal as
s incoming into a new set of registers). L atency and
∗ . ∗ processing most relational-style Our weighted∗
medians or and frequent items. semantics. The operator
b1 b2 sum operation
fied by the query’s ADVANCE2clause (in this par-both 1.1
rate of this implementation for pro q
b1 q
1
b ∗ jection are
2
q
wsum x|k,l q2 consumes but needs toof qits left-hand the last
behaves similarly, the output remember only sub-plan
DVANCE = 1 and we could simplify our circuit (q1 ) and slices it into ause of flip-flops is For each window,
four input tuples. The set of windows. a good choice to
Arithmetics and adv line).
q Boolean Operations
uting data valid togate easily finishes within a 2
Again, the ‘and’ the single cycle. Most simplequantities of data. Here weof the will run within
hold such small arithmetic or Boolean operations right-hand
x|k,l invokes a parameterized execution can use them in
Hence, latency and issue rate are both 1. ○a:(b ,b ) op- shiftsingle(x), mode, each occurrence of x replacedalways
indicated in Table 1, we use the generic sub-plan q2 clock cycle. More complex tasks, buffer asby the
a a register with such that the operator such multipli-
1 2
lection and Projection
or to represent arithmetic computations, value compar- cation / division, or floating-point arithmetics, may require
essing in the windowing part of the plan is im- contains the last four input values.
5. AUXILIARY COMPONENTS
or Boolean connectives in relational plans. T he in-

Examples
πPrice,Volume
Price Volume ∗ q1 x|k,l q2 (x) ⇒

n-way union
σa adv
&
a ∗ πPrice,Volume
Price Volume ∗ 0 0 q1 x|k,l 0 2 (x)
q ⇒ 1 CSR2
a
= ○a:(Symbol,"UBSN")
= & & & &
n-way union
Symbol ∗ σa adv ∗ ∗ ∗ ∗
& "UBSN"
a ∗
Trades eos 0 q2 eos 0 q2 eos q2 eos
0 1
q2
CSR2
a & & & &
=
Figure 4: Compiled hardware execution plan for & & & &
Symbol this ∗
Query Q1 . Latency of "UBSN" circuit is 3, issue rate 1. ∗ ∗ ∗ ∗
1 1 0 1 CSR1
eos q2 eos q2 eos q2 eos q2
Trades
& && ∗ &
ll sub-plans have an issue rate of one, this is also the rate
Figure 4: Compiled hardware execution plan for q1
f the complete plan. 2 (Wind)
Query Q1 . Latency of this circuit is 3, issue rate 1. 1 1 0 1 CSR1
4.4 Union Figure 5: Compilation rule for windowing operator
From a data flow pointissue rate theone, thisan also the rate ∗
all sub-plans have an of view, of task of is algebraic (shown for an instance with at most three windows
of operator ∪ plan.
union the completeis to accumulate the output of several 2 open in parallel). q1 (Wind)
ource streams into a single output stream. Since, in our
4.4 Union
ase, all source streams operate truly in parallel, a hardware Figure 5: Compilation rule for windowing operator
From a data ∪ needs of view, proper of an algebraic
mplementation forflow pointto ensure the tasksynchroniza- (shown for an instance with at most three windows
ion. We do so by buffering all input ports using FIFOs:several current window. Sub-plan q2 thus sees a finite input for
union operator ∪ is to accumulate the output of open in parallel).
every execution and may, e.g., use aggregation in a seman-
source streams into a single output stream. Since, in our
tically sound manner.
4-way union
case, all source streams operate truly in parallel, a hardware
FIFOs Our compiler implements this semantics by wrapping q2
implementation for ∪ needs to ensure proper synchroniza-

Examples
5-way union aggregation fun
Algebraic Ag
0 0 1 0 0 CSR2 braic aggregate
& & & & & of state) [9] in
WPrice WPrice WPrice WPrice WPrice coun t , e.g., we
eos
wsum eos
wsum eos
wsum eos
wsum eos
wsum its trigger inpu
Once we reach
& & & & & the counter va
the counter to
1 0 1 1 1 CSR1 In the transl
counter ∗
adv
& σa coun t a (q)
a ∗
a
=
Symbol "UBSN" ∗ we forward the
to implement
Trades
input of the co
constructs a ne
Figure 6: Hardware execution plan for Query Q4 . lar input value
aggregate (we

Examples

q1 grpx|c q2 ( x ) ⇒ for instan
pressed u
n -way union shown her
eos on the rig
∗ ∗ ∗ ∗
ware side
eos q2 q2 q2 q2
q1 q2 is
addr implemen
DEMUX rect the s
data rst put ﬁelds
CAM put regist
data c ∗ erated thi
q1 (GrpBy)
valid. Hen

Figure 7: Compilation rule to implement the ‘group
q1 q2
by’ operator grp.

causes a noticeable eﬀect on the overall chip space consump-
tion. Again,

if data volumes become high. In this case, the data is writ- If no componen
ten into (ex ternal) memory, where logic on the F P G A picks bound (such as a

Streaming De-Multiplexing
it up autonomously after it has received a work request from
the host C P U .
plan optimizer ca
and run (part of )
idea to the plan f
5.3 Stream De-Multiplexing our actual implem
A ctual implementations may depend on specialized func-
tionality that would be inefficient to express using standard
algebra components. In our use case, algorithmic trading, &
input data is received as a multiplexed stream, encoded in a
=
compressed variant of the F I X protocol [4]. E xpressed us- Symbol "UB
ing the Stream B ase syntax, the multiplex stream contains
actual streams like Trade
CREATE INPUT STREAM NewOrderStream ( A s shown in th
MsgType byte, -- 68: new order plan now runs bo
ClOrdId int, -- unique order identifier and finishes with
OrdType char, -- 1:market, 2:limit, 3:stop
T he most appa
Side char, -- 1:buy, 2:sell, 3:buy minus
tion of latency. T
TransactTime long) -- UTC Timestamp
of one. In additio
CREATE INPUT STREAM OrderCancelRequestStream ( sources, primarily
MsgType byte, -- 70: order cancel request before.
ClOrdId int, -- unique order identifier
OrigClOrdId int, -- previous order 6.2 Increasin
Side char, -- 1:buy, 2:sell, 3:buy minus T he elimination
TransactTime long) -- UTC Timestamp cally leads to task
hardware circuit
We have implemented a stream de-multiplexer that in-
terprets the MsgType field (first field in every stream) and
V
dispatches the tuple to the proper plan part.
&

Optimization Heuristics

Reducing Clock Synchronization
Increasing Parallelism
Trading Unions for Multiplexers
Group By/Windowing Unnesting

Evaluation

FPGA software (Linux 2.6) high pack
packets processed 100 % tuples.
100 % 100 %
80 % Our res
60 % the expec
60 %
attractive
40 % 36 %
is used as
the remain
20 %
load. Thi
0% system ca
300,000 pkt/s 1,000,000 pkt/s
data input rate
8. REL
Figure 8: Number of packages successfully processed The ide
for two given input loads. The hardware implemen- cessing da
tation is able to sustain both package rates. explored w
[3]. His D
operated
of 1 means that the circuit can process 100 million input database s
While e

Related Work
Database Processing Systems as Hardware implementation

“database machine” [3]

tailor-made hardware for database processing

Netezza Performance Server system [2]

SPUs = Disk, NIC, a CPU ,and an FPGA

SQL Chip [14]
In Avalanche, we aim at exploiting the conﬁgura- bility of FPGAs. With Glacier, we present a compiler that
compiles queries from an arbitrary workload into a tailor- made hardware circuit.

Processing Model

MonetDB/X100 system[13] (is resembled)

Specialized processors for use in a database context

GPUTeraSort[8]

stream joins for the Cell processor[6]

Streams on wires

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (9)

Similar a Streams on wires

Similar a Streams on wires (20)

Más de Takefumi MIYOSHI

Más de Takefumi MIYOSHI (20)

Último

Último (20)

Streams on wires

Notas del editor