SlideShare una empresa de Scribd logo
1 de 18
Streams on Wires — A Query Compiler for FPGAs

           Rene Mueller                              Jens Teubner                         Gustavo Alonso
      rene.mueller@inf.ethz.ch                 jens.teubner@inf.ethz.ch                 alonso@inf.ethz.ch
                     Systems Group, Department of Computer Science, ETH Zurich, Switzerland




ABSTRACT                                                           The Avalanche project at ETH Zurich addresses the chal-
Taking advantage of many-core, heterogeneous hardware for       lenging questions raised by these scenarios. In Avalanche,
data processing tasks is a difficult problem. In this paper, we   we study the impact of modern computer architectures on
                                                                data processing. As part of the efforts around Avalanche,
                                                        :
consider the use of FPGAs for data stream processing as co-
processors in many-core architectures. We present Glacier,      we assess the potential of using FPGAs as additional cores
a component library and compositional compiler that trans-      in many-core machines and how they could be exploited for
                                                      2010.06.17
forms continuous queries into logic circuits by composing
library components on an operator-level basis. In the pa-
                                                                data processing purposes.
                                                                   In this paper, we report on the work we have done devel-
per we consider selection, aggregation, grouping, as well as    oping Glacier, a library of components and a basic compiler
windowing operators, and discuss their design as modular        for continuous queries implemented on top of an FPGA.
elements.                                                       The ultimate goal of this line of work is to develop a hy-
   We also show how significant performance improvements         brid data stream processing engine where an optimizer dis-
can be achieved by inserting the FPGA into the system’s         tributes query workloads across a set of CPUs (general-
data path (e.g., between the network interface and the host     purpose or specialized) and FPGA chips. In here, we focus
CPU). Our experiments show that queries on the FPGA             on how conventional streaming operators can be mapped to
Abstract
ABSTRACT                                                           The Ava
Taking advantage of many-core, heterogeneous hardware for       lenging qu
data processing tasks is a difficult problem. In this paper, we   we study
consider the use of FPGAs for data stream processing as co-     data proc
processors in many-core architectures. We present Glacier,      we assess
a component library and compositional compiler that trans-      in many-c
forms continuous queries into logic circuits by composing       data proce
library components on an operator-level basis. In the pa-          In this p
per we consider selection, aggregation, grouping, as well as    oping Glac
windowing operators, and discuss their design as modular        for contin
elements.                                                       The ultim
   We also show how significant performance improvements         brid data
can be achieved by inserting the FPGA into the system’s         tributes q
data path (e.g., between the network interface and the host     purpose or
CPU). Our experiments show that queries on the FPGA             on how co
can process streams at more than one million tuples per         circuits on
second and that they can do this directly from the network,     over data
removing much of the overhead of transferring the data to       to operate
a conventional CPU.                                             actual per
                                                                system.
                                                                   The pap
uples per
 network,     over data streams; the proper system setup for the FPGA
e data to
                           Contributions
              to operate in combination with an external CPU; and the
              actual performance that can be reached with the resulting
              system.
                The paper makes the following contributions:

              1. We describe Glacier, a component library and compiler
  rds many-
                 for FPGA-based data stream processing. Besides clas-
 pportuni-
                 sical streaming operators, Glacier includes specialized
  rocessing
                 building blocks needed in the FPGA context. With the
 t go away
                 operators and specialized building blocks, we show how
 r the I/O
                 Glacier can be used to produce FPGA circuits that im-
 ow to ex-
                 plement a wide variety of streaming queries.
ow to deal
host com-     2. Since FPGAs behave very differently than software, we
 machines        provide an in-depth analysis of the complexity and per-
   provides      formance of the resulting circuits. We discuss latency
PUs) next        and issue rates as the relevant metrics that need to be
 dors have       considered by an optimizing plan generator.
e machine
e floating     3. Finally, we evaluate the end-to-end performance of an
  de larger      FPGA inserted between the network and the CPU, run-
  be CPUs        ning a query compiled with Glacier. Our results show
                 that the FPGA can process streams at a rate beyond one
                 million tuples per second, far more than the CPU could.

ed provided     Our work is organized as follows. The upcoming Sec-
which typically indicate a turbulent market situation. At        Table 1: Supported streaming a

                     Motivating Applications
the same time, latency is critical and measured in units of      names; q, qi : sub-plans; x: para
microseconds (µs).                                               input).
   To abstract from the real application, we assume an input
stream that contains a reduced set of information about each
trade handled by Eurex (the actual streams are implemented       of UBS stocks (similar functionality
  with a Swiss bank
as a compressed representation of the feature-rich FIX pro-      plement finite-impulse response filter
tocol [4]). financial trading application receives data from a setdetermines the average trade prices
     Their Expressed in the syntax of StreamBase [16], the        of streams with up-to-date
     market information from different stock exchanges.
schema of our data would read:                                   over the last ten-minutes window:
        CREATE INPUT STREAM Trades (                               SELECT count () AS Number
          Seqnr int,              -- sequence number                 FROM Trades [ SIZE 600 ADVANC
          Symbol string(4), -- valor symbol                         WHERE Symbol = "UBSN"
          Price int,              -- stock price                     INTO NumUBSTrades
          Volume int)             -- trade volume
                                                                   SELECT wsum (Price, [ .5, .25, .12
   To keep matters simple, we look at queries that process           FROM (SELECT * FROM Trades
a single data stream only. In particular, we disallow the                     WHERE Symbol = "UBSN")
use of joins. To facilitate the allocation of resources on the             [ SIZE 4 ADVANCE 1 TUPLE
FPGA, we restrict ourselves to queries with a predictable            INTO WeightedUBSTrades
space requirement. We do allow aggregation queries and             SELECT Symbol, avg (Price) AS Av
windowing; in fact, we particularly look at such functionality       FROM Trades [ SIZE 600 ADVANC
in the second half of Section 4.                                    GROUP BY Symbol
   These restrictions can be lifted with techniques that are         INTO PriceAverages
no different to those applied in software-based systems. The         We use these five queries in the fol
necessary FPGA circuitry, however, would introduce addi-         various features as well as the composi
tional complexity and only distract from the FPGA-inherent       compiler.
considerations that are the main focus of this work.
                                                                 2.3 Algebraic Plans
2.2 Example Queries                                                 Input to our compiler is a query rep
1   x|k,l   2                                                            Input to our compiler is a
  Our first set of example queries substituted on each wind.;
                    apply q2 with x is designed to illustrate            bra for streaming queries. O

                           Example Queries
a hardware-based implementation for the most basic oper-
                    t ∈ {time, tuple}: time-, or tuple-based             the algebra dialect listed in T
ators 1in stream processing. Queries Q1 and Q2 field simple
    q      q2       concatenation; position-based use join               be composed in an arbitrary
projections as well as selections and compound predicates:                    Our algebra uses an “asse
   Table 1: Price, Volume
    SELECT Supported streaming algebra (a, b, c: field                    breaks down selection, arithm
   names; q, qi : sub-plans; x: parameterized sub-plan
      FROM Trades                                                        into separate algebra operat
   input). Symbol = "UBSN"                       (Q1 )            simple projections as well as selections and
                                                                         similar representation also su
     WHERE                                                        compound predicates

      INTO UBSTrades                                                     tics of XQuery [11]. In the c
                                                                         notation turns out to have n
   ofSELECTstocks (similar functionality is used, e.g., to im-
      UBS Price, Volume                                                  flow in a hardware circuit an
       FROM Trades
   plement finite-impulse response filters). Finally, Query 2 )5
                                                         (QQ             for parallel evaluation.
   determines the average trade prices for > 100000 symbol
      WHERE Symbol = "UBSN" AND Volume each stock
                                                                              Figure 1 illustrates how ou
   overINTO last ten-minutes window:
        the LargeUBSTrades                                               to express the semantics of Q
     SELECT count () AS Number
   Financial analytics often depend on statistical information           how in Figure 1(a), e.g., ope
                                                                  counts the number of trades of UBS shares over
       FROM Trades [ SIZE 600 ADVANCE 60 TIME ]
from the data stream. Using sliding-window and grouping)          the last 10 each (600 seconds) and returns the
                                                                         of minutes input tuple with the r
                                                          (Q3     aggregate every minute.
      WHERE Symbol =Q3 counts the number of trades of UBS
functionality, Query  "UBSN"                                             explicit. Its output, the new
shares INTO the last 10 minutes (600 seconds) and returns
       over NumUBSTrades                                                 to filter out non-qualifying t
the aggregate every minute. In Query Q.125 ]) AS Wprice
     SELECT wsum (Price, [ .5, .25, .125, 4 , we assume the                   The concatenate operator
       FROM (SELECT * FROM Trades
presence of an aggregation function wsum that computes the        the presence of an aggregation by position”. T
                                                                         called a “join function wsum that
weighted sum WHEREthe prices"UBSN") the last four trades)
                over Symbol = seen in                     (Q4     computes the weighted sum over the prices seen in
                                                                         are combined into a wide re
                                                                  the last four trades of UBS stocks (similar
             [ SIZE 4 ADVANCE 1 TUPLES ]                          functionality is used
       INTO WeightedUBSTrades
     SELECT Symbol, avg (Price) AS AvgPrice
       FROM Trades [ SIZE 600 ADVANCE 60 TIME ]                   the average trade prices for each stock symbol
                                                          (Q5 )   over the last ten-minutes window
      GROUP BY Symbol
       INTO PriceAverages
Algebraic Plans
              πa1 ,...,an (q )    projection
              σa (q )             select tuples where field a contains true
              ○a:(b1 ,b2 ) (q )   arithmetic/Boolean operation a = b1 b2
 llabora-     q1 ∪ q2             union
plication
              agg b:a (q )        aggregate agg using input field a ,
  market
                                  agg ∈ {avg, count, max, min, sum}
 rmation
packages      q1 grpx|c q2 ( x ) group output of q1 by field c, then
a rate at                         invoke q2 with x substituted by the group
 s rate is    q1 t x|k,l q2 ( x ) sliding window with size k , advance by l ;
5].                               apply q2 with x substituted on each wind.;
] cannot                          t ∈ {time, tuple}: time-, or tuple-based
ential fi-     q1 q2               concatenation; position-based field join = join by position
 uations,
 ion. At     Table 1: Supported streaming algebra ( a, b, c: field
 units of    names; q, qi : sub-plans; x : parameterized sub-plan
             input).
an input
out each
.2 q1 Example Queries
         t
           q2 (x) sliding window with size k, advance by l;                                                    Input to our compiler is a query representation
               x|k,l
  Our first set of example queries substituted on each wind.;
                              apply q2 with x is designed to illustrate

                                                                Algebraic Plans
                                                                                                            bra for streaming queries. Our compiler currentl
  hardware-based implementation for the most basic oper-
                              t ∈ {time, tuple}: time-, or tuple-based                                      the algebra dialect listed in Table 1, whereby ope
 tors 1in stream processing. Queries Q1 and Q2 field simple
     q         q2             concatenation; position-based use join                                        be composed in an arbitrary fashion.
 rojections as well as selections and compound predicates:                                                     Our algebra LargeUBSTradesLargeU BS Trades NumU BS Trades
                                                                                                                                        uses an “assembly-style” represent
                                                                                                                                                                  NumUBSTrades
                                                                                                                     UBSTrades             π
   Table 1: Price, Volume
     SELECT Supported streaming algebra (a, b, c: field                                                      breaks down selection, arithmetics, and predicate
                                                                                                                                        U BSPrice,Volume
                                                                                                                                              Trades
                                                                                                                                                                         time
                                                                                                                                                               πPrice,Volume
                                                                                                                                                                         x|600,60           time         Wei
                                                                                                                                                                                            x|600,60
   names; q, qi : sub-plans; x: parameterized sub-plan
          FROM Trades                                                                                       into separate algebra operators. In earlier work, w
                                                                                                                     πPrice,Volume               σc
                                                                                                                                        πPrice,Volume          Trades σc     coun t Number
   input). Symbol = "UBSN"                                                            (Q1 )                 similar representation also suited to express,countNum
                                                                                                                                             ○c:(a,b)
                                                                                                                                              ∧                                      Trades
                                                                                                                                                                                                   e.g.,
       WHERE                                                                                                              σa                                     ○c:(a,b)
                                                                                                                                                                  ∧
                                                                                                                                                                                   σa                      σa
          INTO UBSTrades                                                                                                                > σa In the
                                                                                                            tics of XQuery○b:(Volume,100000)> context of the current
                                                                                                                                         [11].                                                          σa
                                                                                                                                                           ○b:(Volume,100000)
                                                                                                                 ○a:(Symbol,"UBSN")
                                                                                                                 =                                                         ○a:(Symbol,"UBSN") ○a:(Symb
                                                                                                            notation turns a:(Symbol,"UBSN")
                                                                                                                                     ○ ○a:(Symbol,"UBSN") nice correspondences t
                                                                                                                                     = out to have
                                                                                                                                        =                                  =                      =
                                                                                                                                                                                             ○a:(Symbol,"
                                                                                                                                                                                              =
   ofSELECTstocks (similar functionality is used, e.g., to im-
       UBS Price, Volume                                                                                                                                   ○a:(Symbol,"UBSN")
                                                                                                                                                            =
                                                                                                            flow in a hardware circuit and helps detecting opp
          FROM Trades                                                                                                  Trades                 Trades                                 x                  Trad
   plement finite-impulse response filters). Finally, Query 2 )5                              Q                                             Trades                   Trades                               x
                                                                                      (Q                    for parallelQevaluation.Q2
                                                                                                                  (a) Query 1            (b) Query                     (c) Query Q3
   determines the average trade prices for > 100000 symbol
       WHERE Symbol = "UBSN" AND Volume each stock
                                                                                                               FigureBS Trades
                                                                                                                                      (a) Query Q1           (b) Query Q2
                                                                                                                              1 illustrates how our streaming algebra c
                                                                                                                                                                                          (c) Query Q3

   overINTO last ten-minutes window:
            the LargeUBSTrades                                                             LargeU BS Trades          NumU
                                                                                                            to express the semanticsAlgebraic query plans for PriceAver
                                                                                                                                           Figure 1:
                                                                                                                                                               Queries Q1 throughfive
                                                                                                                                                           of Figure 1: Algebraic query plan     the
                                                                                                                                                                                                          Q
                                                                       U BS Trades             πPrice,Volume               time
                                                                                                                                                         WeightedU BS Trades
     SELECT count () AS Number
  Financial analytics often depend on statistical information                      LargeUBSTrades how in Figure 1(a), e.g., operator ○ makes the c
                                                                                                             NumUBSTrades  x|600,60
                                                                                                                                                                               =
                                                                       πPrice,Volume                σc                       6-to-1 lookup tables                      69,120                       a time
                                                                                                                                                                                                       piece
          FROM Trades [ SIZE 600sliding-window and grouping
rom the data stream. Using                      ADVANCE 60UBSTrades
                                                                TIME ]                                                                                               tuple
                                                                                                                  Tradesinput tuple 6-to-1 lookup requested stock symbo
                                                                                        (Q3 )∧ c:(a,b) of each flip-flops (1-bit flip-flopsthe registers)
                                                                                       πPrice,Volume                                            with
                                                                                                                   time         countNumber registers)            tables
                                                                                                                                                                       69,120
                                                                                                                                                                     x|4,1
                                                                                                                                                                                           69,120 ing the
                                                                                                                                                                                         PriceAverages x|600,
                                                                                                                                                 WeightedUBSTrades
       WHERE Symbol = "UBSN"                                                                     ○                 x|600,60                                (1-bit                          69,120
unctionality, Query Q3 counts the number of trades of UBS      πPrice,Volumeσa                c             explicit. block RAM block RAM 296×18WPrice:(Price,[··· ]) is usedgra
                                                                                                           Trades
                                                                                                                                    output, the a
                                                                                                                             25×18-bit
                                                                                                                       countNumber
                                                                                                                                                            new column a, kbit RAM
                                                                                                                              Its σa multipliers σtuple wsum kbit 296×18 Trades
                                                                                                                                                                            64
                                                                                                                                                                                              time
                                                                                                                                                                                                    progra
          INTO the last 10 minutes (600 seconds) and returns
                   NumUBSTrades                                                            ○b:(Volume,100000)
                                                                                           >                                                                x|4,1
                                                                                                                                                25×18-bit multipliers (operator σ ).
                                                                                                                                                                                              x|600,60
                                                                                                                                                                                                64 progra
 hares over                                                        ○a:(Symbol,"UBSN") ○c:(a,b)
                                                                    =                    ∧                  to filter out non-qualifying rate MHz
                                                                                                                             typical clock rate
                                                                                                                             ○a:(Symbol,"UBSN")typical clock
                                                                                                                             =
                                                                                                                                                                 tuples
                                                                                                                                                                    100
                                                                                                                                                   ○a:(Symbol,"UBSN")
                                                                                                                                                   =                                    100 MHz tribute
                                                                                                                                                                                                       ax
                                                                      a                    ○a:(Symbol,"UBSN")
                                                                                            =                                                                                 x
                                                                                                                                                           wsumWPrice:(Price,[··· ])   Trades grpy|Sym
 he aggregate every minute. In Query Q.125 ]) AS ○b:(Volume,100000)
     SELECT wsum (Price, [ .5, .25, .125, 4 , we assume the                        Wprice
                                                                                   >                           The concatenate operator
                                                                                                                               a                     a
                                                                                                                    Table 2: Xilinx XC5VLX110T characteristics.
                                                                                                                                                                             represents what        the FP
          FROM (SELECT * FROM Trades                         ○a:(Symbol,"UBSN")
 resence of an aggregation function wsum that computes the Trades called a:(Symbol,"UBSN") ○a:(Symbol,"UBSN") Tuples from both inp
                                                             =            Trades
                                                                                   ○a:(Symbol,"UBSN")
                                                                                    =                               ○ a “join xby =
                                                                                                                     =                      position”. x Trades
                                                                                                                                       Table 2: Xilinx XC5VLX110T characteristics.               x A pa  avg
                                                                                                                                                                                                    of (e)
                                                                                                                                                                                                        cont
weighted sum WHEREthe prices"UBSN") the last four Q1 (Q4 ) Query Q2 they arrive.Queryoperator a necessary, for instance, to evalu- the tionalQ
                      over Symbol = seen in                          (a) Query
                                                                 Trades
                                                                                   trades    (b)
                                                                                         Trades
                                                                                                            are combined Q3
                                                                                                                        (c)                                  (d) Query Q4
                                                                                                                                The into is wide result tuple in
                                                                                                                              x they arrive.Trades
                                                                                                                                                                                                     orde    m
                   [ SIZE 4 ADVANCE 1 TUPLES ]                                                                 ate and return different aggregation functions over thefor instance, to dat
                                                                                                                                                   The operator is necessary, same
                                                                                                                                                                                                    by
                                                                                                                                                                                                          eva
                                                             (a) Query Q1            (b) Query Q2 1: Algebraic query plans return different aggregation functions overQuery
                                                                                              Figure           input stream.3 ate and for the five example queries Q1 to Q5the sa
                                                                                                                (c) Query Q                           (d) Query Q4                                   .
                                                                                                                                                                                               (e) Typica
          INTO WeightedUBSTrades                                                                                                   input stream.
                     LargeUBSTrades     NumUBSTrades                                                                                                                                                the add
    UBSTrades Symbol, avg (Price) AS AvgPrice
     SELECT                                                                                                    3. FPGAS FOR STREAM PROCESSING to
                                                                               6-to-1Figure tables PriceAverages plans for the five example queries Q1 to Q5 .the RAM c
                                                                                       lookup 1: Algebraic query                                                                                    tionalit
                        πPrice,Volume        time
                                                                    WeightedUBSTrades
                                                                                                                         69,120
                                                                                                                                   3. FPGAS FORof data, its address, flip-flop reg
                                                                                                                                                     a piece
                                                                                                                                                                STREAM PROCESSING                   constan
          FROM Tradesσ [ SIZE 600x|600,60       ADVANCE 60 TIME ] flip-flops (1-bit registers) time its69,120heart, every FPGA chip dataFPGA chipFPGAs, of three mr
                                                                                                                  At       very                      ing the consists of three main
                                                                                                                                                                       back. In
     πPrice,Volume             c                                               block RAM5 )
                                                                                tuple   (Q                     types of components. its a piece of data, distributed overthe RAM and tig
                                                                                                                 296×18 kbit          At A large heart, every
                                                                                                                                              very number of lookup tables (LUTs)chip
                                                                                                                                                     RAM are                         consists
                                                                                                                                                                                        the             We
       GROUP BY Symbol                Trades     countNumber           6-to-1 lookup tables
                                                                                x|4,1
                                                                               25×18-bit multipliers
                                                                                                                 69,120
                                                                                                                x|600,60                                           its address, to lookup tables (LU
                                                                                                                               64 types of ing the of logic large number lookup
                                                                                                                                             components. A gates. Each of
                                                                                                               provides a programmable type data back. In FPGAs,addition, registers
                                                                                                                                                     programmable logic. In
                                                                                                                                                                                                    chip, t
                                                                                                                                                                                                    tion prt
                                                                                                                                                                                                  lookup
                         ∧ c:(a,b)                                     flip-flops (1-bit registers)                69,120                                                                flip-flop
          INTO PriceAverages                                                                                                                         programmed at runtime gates. beimplem
                                                                                                                                   provides aarbitrary 6 bit → 1 bit logicand thusEach look
                                                                                                                                                 programmable type of function.                      used a
          σa                                         σa                σa     wsumWPrice:(Price,[··· ]) Trades kbit MHz
                                                                               typical clock rate
                                                                       block RAM                               table100 implement an
                                                                                                                       can
                                                                                                           296×18 grpy|Symbol                RAM are distributed over the chip and tightly w        memor
                         > b:(Volume,100000)                                25×18-bit multipliers             Lookup tables are wired through an an arbitrary such, → 1 bit tables
                                                                                                                     64
                                                                                                                               table can implement memory. As 6 bit the on-chip sto
                                                                                                                                              tributed
                                                                                                                                        programmable interconnectan interconnect netwo
                                                                                                                                                                      network         functi
                                                                                                                                                         logic. In addition, lookup The ac
    = a:(Symbol,"UBSN") these five queries in the following to demonstrate                                                      Lookup tablesthe FPGA can be accessed in a truly paral
                                                                                                                                                are wired through
        We use =                                = a:(Symbol,"UBSN") = a:(Symbol,"UBSN") Xilinx XC5VLX110T x can route that canacross the chip.runtime and thus Finally, rare up
                                                                            Table 2:
                                                                            typical clock rate
                                                                                            x                100 MHz avgAvgPrice:Price programmed at across the flip-flops used as add
                                                                                                              that              signals
                                                                                                                   characteristics.                         Finally,
                                                                                                                                          route 1-bit storage units that can
                                                                                                                                                 signals                    be
                                                                                                                                                 A particular use ofchip. potential flip-flo
                                                                                                                                                                      this             is the
                           a:(Symbol,"UBSN")                                                                  (also called registers) provide memory. As such, the on-chip storage r
                                                                                                                                        tributed
    various features as well as the compositionality of the Glacier                                                                                     provide 1-bit storage unitsthat is
                                                                                                                               (also called registers)logic.
                                                                                                                                              of content-addressable memory (CAM). c
                                                                                                              directly be wired into the remaining be accessed in a truly parallel fasthat O
         Trades               Trades                    x            Table 2: Xilinx XC5VLX110T characteristics.yto evalu-the wired into the remaining logic.
                                                                          Trades                                                            FPGA can
                                                                        they arrive. The operator is necessary, Theinstance, directly betables, the wiring of the intercon-
                                                                                                                  for behavior of lookup                                             lookup
                                                                                                                                              tional memory, content-addressable memor
    compiler. (b) Query Q                                                                                                                  A particular use of this potentialof thelatency
                                                                                                                                                                                is theinterco
                                                                        ate and return different aggregation nect, Query Q5 initial state of byof lookup tables, all be con- explicit me
                                                                                                               functions over theThe behavior data values, rather wiring
                                                                                                                                                                                        impl
    (a) Query Q1                         2   (c) Query Q3                      (d) Query Q4                      (e) and the       same        the flip-flops can the than by
                                                                        input stream.                                          nect, The content-addressable the flip-flops can all a co
                                                                                                                                        of     initial state of memory (CAM). Other t
                                                                                                              figured by software. and the Typically,content-addressable memory begiv
                                                                                                                                                                                     implem
     2.3        Algebraic Plans                                                                                                             actual configuration are typically resolve can
                                                                  they arrive. The operator is necessary, for instance, to evalu-                          CAMs is used to
                                                                                                                                        tional memory,
                                                                                                                               figured by software. Thelanguageconfiguration istwo cy
                                                                                                                                                             actual (such as          typica
                           Figure 1: Algebraic query plans for the five example queries Q1 to Q5described using a hardware descriptionrather than by explicit More ge
                                                                                                               .
                                                                  ate and return different aggregation functions over the same                 the address it has been stored at.
                                                                                                                                        byusing a hardware description language (such
                                                                                                                                            data values,                            memory
                                                                                                                               described
                                                                                                              VHDL or Verilog) and loadedtionality can implement an arbitrary key-
                                                                                                                                               into the FPGA.
FPGAs for Stream Processing
                                       notification                               algebraic                       VHDL
                                                                                 query plans                     Code         Xilinx
                        query plan                  CPU                                             Glacier
                                                                                                                            Synthesizer
   network                                                            the FPGA is directly connected to the physical network interface,
              NIC          π σ                                (a)     with parts of the network controller implemented inside the
                                                   Main               FPGA fabric.
   data                                           Memory
   stream                                                                    Figure 3: Compilation of abstract que
                        FPGA
                                       notification                           hardware circuits for the FPGA.

                        query plan                  CPU
                                                                             sub-plan adheres to the same well-defined wi
                 data      π    σ                             (b)
              stream                               Main
                                                  Memory
                                                                             4.1 Wiring Interface
                                                                      the FPGA can also be used in a traditional co-processor setup

                                                                                   As in data streaming engines, our processin
                        FPGA
                                                                                tirely push-based. Each n-bit-wide tuple is
                                                                                a set of n parallel wires in the FPGA fabric
Figure 2: System Architectures: (a) Stream engine
                                                                                wires, a new tuple can be propagated in eve
between network interface and CPU. (b) Stream
                                                                                FPGA’s system clock (i.e., 100 million tuple
engine as a coprocessor to the CPU.
                                                                                An additional data valid line signals the pres
                                                                                in a given clock cycle. Tuples are only consid
                                                                                of the data stream if their data valid flag is s
tem main memory, then informs the host CPU about the
                                                                                if the data valid line carries an electrical “hig
arrival of new data (e.g., by raising an interrupt).
                         The can also Figure 2 (a) fits pattern that is highly common inthe following, we use rectangles to repre
  Alternatively, the FPGA architecture inbe used in aa traditional
                                                                                   In data stream applications.
                                                                                ponents (with the exception of multiplexers
co-processor setup, as illustrated in Figure 2 (b). Here, the
                                                                                use the common trapezoid notation). Our
CPU hands over data to the FPGA either by writing directly
                                                                                clock-driven or synchronized and every oper
into FPGA registers (so-called slave registers) or it prepares
                                                                                brary writes its output into a flip-flop regi
Query Compilation

                          algebraic                  VHDL
                          query plans                Code        Xilinx
CPU                                       Glacier
                                                               Synthesizer
        (a)                                                                       circuit
                                                                                 on FPGA
Main
emory
                              Figure 3: Compilation of abstract query plans into
                              hardware circuits for the FPGA.
        The compiler applies compilation rules (Section 4) and op- timization heuristics
        (Section 6), then emits the description of a logic circuit that implements the input plan.


CPU
                      sub-plan adheres to the same well-defined wiring interface.
        (b)
Main                  4.1 Wiring Interface
emory
                         As in data streaming engines, our processing model is en-
                      tirely push-based. Each n-bit-wide tuple is represented as
                      a set of n parallel wires in the FPGA fabric. On a set of
ream engine
                      wires, a new tuple can be propagated in every cycle of the
 (b) Stream
                      FPGA’s system clock (i.e., 100 million tuples per second).
down the data path, as shown in R ule Proj: &               &
  rator’s leftmost observableFor
  orresponds to the output.com-
   rectangles to represent logic response time, whereas
    4: Compiled hardware execution plan for
ue we multiplexers, throughput.
     rate determines black-box
 ,ion of depict thefor which we                     q

                                      Wiring Interface
Qhardware Our circuits are all circuit is 3, issue rate 1.
a 1 . Latency of this
  d notation). implementation
     Synchronization the right.
  zed and every operator in our li-                                     πa1 ,...,an (q) ⇒           ···       .1   (Proj) 1
ery q as shown on after pro-
  nto a flip-flop register also need to be considered dur-
                                                                                                    a1 an ∗
se arrows sometimes boxes and
   properties to denote the wiring between hardware
gisters as gray-shaded                                                                                 q
   ry compilation. For instance, all compilation rules
plansthat a generated circuit willwe identifythis islines T his implementation for πa1 ,...,an has an interesting side
ents. Wherever appropriate, never try to those also the rate
  explicit as
nsure have an issue rate of one, push
tuple For that correspondan operator thattuple field
 utput.
 ples in bus                         to a specific has
   black-box plan.cycles into
           successive
omplete than one. We use two types of logic Our circuits are allOclock-driven or synchronized and every
                            q                                   effect. ur compiler emits the description of a hardware cir-
                                                                                  2
abel at less respective input/output port. The labelin ourthat iswrites its output into synthesizer to generate the actual
  e rate the
  mentation                                                     cuit library passed into a a flip-flop register
                                                         operator
ds for “all remaining fields”. We between sub- after processing. configuration for the F P G A . T he synthesizer op-
nentsright.
    the to implement the synchronization do not represent       hardware
Union
  :
ote the wiring betweentuple. The hardware plan for the
 r of fields within a hardware                                   timizes out “dangling wires”, effectively implementing pro-
                                                                                          Figure 5: Compilation r
expression σa (q)flag thus be for streams with of an algebraic for free. (shown for an instance
ppropriate, we identify those lines illustrated as
  a data flow  data      can        registers
 queues act asvalidpoint of view, the task
                    short-term buffers
                                                                jection pushdown
orrespond to a specific tuple field                          T here is no actual work to do at runtime (though fields
  varying data rate. T hey emit data at a predictable
 tive input/output port. The label
perator ∪ is notrate of an upstream sub-circuit.
ate, typicallyWe do
ning fields”.
                       to accumulate the outputare propagated into a new set ofin parallel). and
               the issue represent
                                                          of several             open registers). L atency
 treams hardware the for theoutput stream. Since,rate of this implementation for pro jection are both 1.
   tuple. The runtime, single       .
                  & as with the output rate.
                                                  σa(q) issue in our
aote that, atinto a plan average input rate must not
xceed what illustrated
can thus be is achievable                               4.3 Arithmetics and Boolean Operations
  source streams a ∗ truly in parallel,circuit, the logical ʻandʼ gate invalidates the output
                          operate                       In this a hardware
n most practical cases, the depth of the F I F O can be tuple wheneverindicated in false. 1, we use the generic ○a:(b ,b ) op-
                                                                   A s field a contains Table
  ntation for ∪ needs to ensure proper synchroniza- arithmetic computations, value compar-
                           q
ept very low.. T his not only implies a small resource          erator to represent
                                                                                                                1   2


eotprint, but also means that the impact onports usingisons or Boolean connectives in relational plans. Sub-plan
    do ∗ by buffering all input the over-
    a     so                                                      FIFOs:                      current window. T he in-
  ircuit, the typically ‘and’ gate invalidates the output
  l latency is logical small.                                   stance                        every execution and may, e
     q
henever field a contains false.                       FIFO
                  −n
  operatorsinvalidates block data items for a fixed num-
                     can the union
                 z 4-way output                                                          1
                                                                                           ○tically sound manner.
                                                                                            = a:(b ,b ) (q) ,
                                                                                             2
   ‘and’ gate                               FIFOs
ntains false.Characteristics e.g., to properly syn-
   Circuit
 er of cycles n. T his can be used,                             e.g., will emit all fields in q, Our compiler field a that
                                                                                                   ex tended by a new implements
hronize the output of slow arithmetic operators with            contains the outcome of b1 = b2a template circuit (full
bove circuittuple flow (the circuit below implements clock
 teristics
he remaining      will compute its output in a single                                         into .
 d will its output in to consume a of input tuple at T his semantics directlyure 5). We an implementation add
○a:(b ,b )be ready a that the latency new is 2):
            (q); assume single clock
                                                                                               translates into
                                                                                                               introduce an
 teconsume clock.inside the union component logic (wefor- a similar circuit a moment ago):
 ompute
     1  machine input tuple that its latency and issue then saw
         2
ck of the a new We say at
                                                                in
 to
uples1. In its the input FIFOs in more than one fashion
We say from latency circuits may need a round-robin
                                                                                              stream”) next to the data v
 both      that general, and issue
                                                                                                            a
 l, circuits may need theirthan one
                             a                       Delay operators (q) ⇒ a window closes to notify t
                       more computation can be picked up ○           = a:(b ,b )                          =            (Equals)
heir them as can beunion result.
 ts the result of −2 picked −2
       computation the
 til                                                                  1   2
                                                                                                                   .
                                  up
 perator output—they larger a latency that is larger the union fixed number of cycles n bof ∗that window. T
                     z
 theyeverylatency that is have
gh
                                z     .
        have a individual input may feed z can block data items for a
                                                         into
                                                        −n                                    elements 2
                                                                                                   b1
    Due circuits semantics, ∗
                       b1       b2
  mantics,to theirthat implement circuits that implement                                      sub-plan to start generating
                                                                                                          q
g or at an arbitrary produce output before they
ent windowing cannot they rate (i.e., issue rate 1), the
annot produce output before tuple
                           q
   of the respective query streams together must Most simple arithmetic or Boolean operations userun within
  rate of all input window.                                      not exceed                       A common will case are s
en the last to be theof the respective query window. single clock cycle. More complex tasks, such as multipli-
                  tuple number of                               a
  efine latency                                                                                ples belong to several wind
Synchronization
  &         &                     &             &                    of Though[9] ain orindividual input may feed may . require (Proj)
                                                                     cation / division, a straightforward fashion. To implement
                                                                                     π 1 ,...,a floating-point arithmetics, into the union
                                                                          state) every n (q) ⇒                       ···
  earlier how our assembly-style selection WPrice
 rice         WPrice
                                                       opera-         coun t , e.g., we use Sometimes, counter component and wire
                                                                      additional latency. a standard the actual acircuit that im-
                                                                                                                     a1 n ∗
                            WPriceneed WPriceconsidered dur- component at an arbitrary tuple rate (i.e., issue rate 1), the
  operties sometimes also circuit. Compilation
                                          to be                       plements        can be tuned within the trade-offs latency, is-

                                                    Query Units
    be cast into a hardware                                                                                             q
           wsum eosFor instance, all compilation rules average rate of all the data valid signal of the input stream.
                         wsum eos wsum eos wsum
    compilation. translation in the notation we also
                                                                     its trigger input to input streams together must not exceed
ormalizes this                                                       Once weand chip spaceper cycle, whichthe latency of(a) emit
                                                                     sue rate, reach                consumption. If                      is
      eos
ure that a of this&
remainder generated circuit will never try toto
                         work. We & the ⇒ &
                                      use
                                                                     more than one the endoperatorsπhave to is the maximum tu-
                                                                                           tuple of the current stream, we
                                                   symbol push the counter implementation for a1 ,...,an be introduced to
                                                                     greaterTthan one, delay
                                                                                 his value to the upstream data pathan interesting side
                                                                                                                           has and
          &
 s in successive cycles and, as before, assume                       synchronize the operator
                                                                           effect. to                   component the forward (b) reset
                                                                     ple rate thatur compiler emits the description of a up-stream
                                                                                        the union output with can remaininghardware cir-
 e “compiles to” relationinto an operator that has the counterO zero to prepare for the next input stream.                           fields
 rate 0 labeled q 1is the We 1usethat results from
  angleless than one. circuit two types CSR1             of logic the shownpath. In terms of 4.1.2).
                                                                     (as data that is passed into latency, the state machine inside
                                                                                   before in Section a
                                                  1                      Incuit translation rulesinglesynthesizer to generate the actual
                                                                             the                       for coun t a (q), process. The FIFOs
nts logical ‘and’ gate synchronization between sub- cy- operator requires a
he to implement the & completes within a single the
q:                                                                         hardware configuration for cycleFtoG A . T he synthesizer op-
                                                                                                              the P
  T herefore, the∗latency and the issue rate of the circuit the input, ith the rules we have either flip-flop can now or
                                                                     at timizes W implementedwires”, seen so far, implementing pro-
                                                                     Example. out “dangling using effectively we registers
 rated for σa are both 1.                                                                          eos          a
                                                                     block RAMpushdown fortrade-off), a hardware circuit. In cy-
                                                                      translate our (a resource query into add another latency
                                                                           jection    first example free.
        we        the &               σa
ere,σa& act ⇒ short-term buffers .for a1 ,...,a(Sel) discard igure Tthereillustrated the circuit to has at minimum(though fields
  eues(q)use as pro jection operator π streams with cle.coun a (q) ⇒           4, overall circuit therefore do a runtime (Count)
                                                                      F The we is no actual work that results from applying of  , latency
                                                        n to                                              counter
                                                                                                     rst
srying datatuple flow. Supportdatafieldarenaming (often Depending on theto Q uery Q1set of registers). L atency and
    from the a rate. T heyaemit for at predictable 2. compilation rules into a new . distribution, however, the
                             ∗        ∗                              our are propagated input data                        ∗
 essed usingthe issue rateqof an upstream sub-circuit. observed latency this implementation reflects tuples queue up 1.
 , typically the π operator) is a straightforward ex ten- T he hardware circuit be higher whenever the shape of
                     a                                                     issue rate of may quite literally for pro jection are both
                                                                                                                   q
eof what we present the averagea:(Symbol,"UBSN")
     that, at runtime, here. ○ input rate must not in an input FIFO. E ach of the operators can individually
                    =                  =                              the algebraic plan.
 scarding is field"UBSN" leavestuple output rate. means to Strictly speaking,signal to the data validOperations
  t the Symbol achievable the all tuples essentially
          resulting circuit ∗
  ed what a              from with the flow simply                    we forwardArithmetics and Booleanand issue sufficient
                                                                           4.3
                                                                     operate in a the eoscyclebinaryhave latency
                                                                                      single      a (i.e., union component is rates
                                                                                                                              output register
wire instance, could tuples by set ting their data
  invalidates discarded be ex-                                       to one). s Since all plan operators are applied sequentially,
                                                                     of
most the respective outputdepth of the F I F O canfurther implement the algebraic ∪ operator: generic ○a:(b1 ,b2 ) op-
         practical cases, the ports with any inputs be to implement (a) and Table the same the
   for                                          NumUBSTrades                  A indicated in feed 1, we use signal into the reset
ntothe data Tpath, as shown in R ule nature time the
   pressed using the very similar in Proj: to
                  his is
       false. Trades query plan                                      latencies add counter the circuit in F igure 4 has an overall t
                                                                                       up and to implement (b). Note
 vectors”here following column
   shown that Monet B / X            implies a small resource input of the represent arithmeticissue rate of a that coun a
    very low. T his notDonly 100 [13] uses to avoid  x|600,60              erator three. B y contrast, the computations, value compar-
                                                                                    to
                                                                     latency of a new output field without reading any particu-   pipelined
 print, but also On the that the impact on the over- constructsplanBoolean connectives slowest sub-plan. SinceT he in-
ng. the right.
   on                  means hard-            Trades                       isons or is determined by itsunion
                                                                     execution                           2-way in relational plans.
Hardware (q) ⇒ small. · · ·for Query Q4 . (Proj) input value. The operator emits no other field but the
 atencyside,      execution plan                                     lar stance
   ware is typically
      πa1 ,...,an the semantics of                   .
                                      a1 an ∗ minPrice                           q ∪q          ⇒
                                                                  maxaggregate1 (we2 handle grouping separately,. see next). For      (Union)
   q1 q2 is straightforward to
                                                                     Price                                 ∗         ∗
   implement.         can block di- items for a fixed num- the algebraic aggregates we=consider, coun t , sum, avg, m i n,
perators z −nWe simply data q                                 σa                                         ○a:(b1 ,b2 q(q) ,
                                                                                                          q1        )2
es cycles n. Topen window. e.g., to properly syn- and max, the latency is one cycle. A tuple can be applied at
 of thethe signals frombe used,
            oldest his can                            ○a:(Symbol,"UBSN")
                                                       =
his implementation for all 1in- n has an interesting side e.g., will emit all fields in q, ex tended by a new field a that
   rect                         πa ,...,a
  nize fields to a common implements operators with As we willeveryin the cycle (the issue rate is 1).
   put the output of slow arithmetic
he hardware circuit that out-                                  x     the input see clock following, however, the availability of
 t. O ur compiler emits the description theasliding- cir- contains the outcome of b1 = b2 .
                                                  of hardware
 remaining tuple Figure 6. circuitthe windowing
   put register set. flow (the With below implements a general, n-way union implementation eases the implemen-
   Q4 is passed into tuple gen-
                        A
                                                                     Holistic Aggregate directly translates intoaggregate func-
                                                                               T his semantics Functions. For some an implementation
 that is shown in           a synthesizer to generate the actual
   erated thisassume thatmeaningful if both is 2): tuples were of other functionality.
           (q); way onlyfor],the F P G Afour windows
 (b1ADVANCE 1 TUPLES
                             is the latency of input                 tation logic (we saw a similar circuit a moment ago):
  4 ,b2
 ware )configuration a logical ‘and’ gate to combine them: in the state required is not within constant bounds. They
                                  at most . T he synthesizer op-     tions,
   valid. Hence, we use
together at any point in effectively implementing pro-
zes out “dangling wires”, time. Hence, we in-
on pushdown wsum sub-plan. The window type
 copies of the for free.
                                                                     4.5 to batch (parts of) their input until the aggregate can
                                                                     need Windowing
                                                                     be computed when the end of the stream is seen. The pro-
                                                                                                                        a
         is q ⇒
  is tuple-based. −2            a
hereq1 no2 actual The &to do at runtime (though fields The concept,b2for such operators are the computation of
                          work ‘counter’ component .on (Concat)      totype example )windowing bridges the gap between stream
                                                                                ○a:(b1 of (q) ⇒
                                                                                 =                                     =           .    (Equals)
                                      −2
                      z            z
propagated tuples and sends the adv signal as
 s incoming into a new set of registers). L atency and
                              ∗                . ∗                   processing most relational-style Our weighted∗
                                                                     medians or and frequent items. semantics. The operator
                                                                                                                b1          b2 sum operation
fied by the query’s ADVANCE2clause (in this par-both 1.1
    rate of this implementation for pro q
                        b1 q
                              1
                                    b      ∗    jection are
                                                 2
                                                                     q
                                                                     wsum x|k,l q2 consumes but needs toof qits left-hand the last
                                                                             behaves similarly, the output remember only sub-plan
  DVANCE = 1 and we could simplify our circuit                       (q1 ) and slices it into ause of flip-flops is For each window,
                                                                     four input tuples. The set of windows. a good choice to
      Arithmetics and adv line).
                               q Boolean Operations
 uting data valid togate easily finishes within a 2
       Again, the ‘and’ the                                 single cycle. Most simplequantities of data. Here weof the will run within
                                                                     hold such small arithmetic or Boolean operations right-hand
                                                                        x|k,l invokes a parameterized execution can use them in
   Hence, latency and issue rate are both 1. ○a:(b ,b ) op- shiftsingle(x), mode, each occurrence of x replacedalways
    indicated in Table 1, we use the generic                         sub-plan q2 clock cycle. More complex tasks, buffer asby the
                                                                     a a register with such that the operator such                       multipli-
                                                              1 2
 lection and Projection
or to represent arithmetic computations, value compar- cation / division, or floating-point arithmetics, may require
essing in the windowing part of the plan is im-                      contains the last four input values.
   5. AUXILIARY COMPONENTS
    or Boolean connectives in relational plans. T he in-
Examples
                                             πPrice,Volume
               Price        Volume   ∗                                                           q1       x|k,l   q2 (x) ⇒

                                                                                                                n-way union
                                             σa                            adv
         &
                   a                 ∗            πPrice,Volume
                       Price   Volume    ∗                                        0               0 q1          x|k,l 0 2 (x)
                                                                                                                      q           ⇒     1            CSR2
                        a
                       =                     ○a:(Symbol,"UBSN")
                                             =                                     &                  &               &        &
                                                                                                                     n-way union
          Symbol                     ∗         σa                                adv         ∗                  ∗          ∗                     ∗
             &              "UBSN"
                       a                 ∗
                   Trades                                                         eos 0 q2        eos 0 q2           eos q2 eos
                                                                                                                             0              1
                                                                                                                                                q2
                                                                                                                                                       CSR2
                    a                                                                  &                  &                  &              &
                   =                  ○a:(Symbol,"UBSN")
                                      =
Figure 4: Compiled hardware execution plan for                                          &                  &                  &              &
            Symbol     this ∗
Query Q1 . Latency of "UBSN" circuit is 3, issue rate 1.                                          ∗                  ∗                  ∗        ∗
                                                                                       1                  1                  0              1   CSR1
                                                                                       eos       q2       eos       q2       eos       q2   eos q2
                         Trades
                                                                                             &                  &&       ∗         &
 ll sub-plans have an issue rate of one, this is also the rate
    Figure 4: Compiled hardware execution plan for                                                q1
 f the complete plan.                                       2                                                      (Wind)
    Query Q1 . Latency of this circuit is 3, issue rate 1.                           1        1          0       1    CSR1
4.4 Union                                                            Figure 5: Compilation rule for windowing operator
   From a data flow pointissue rate theone, thisan also the rate                                        ∗
    all sub-plans have an of view, of task of is algebraic             (shown for an instance with at most three windows
    of operator ∪ plan.
union the completeis to accumulate the output of several 2           open in parallel).               q1             (Wind)
 ource streams into a single output stream. Since, in our
    4.4 Union
 ase, all source streams operate truly in parallel, a hardware           Figure 5: Compilation rule for windowing operator
       From a data ∪ needs of view, proper of an algebraic
 mplementation forflow pointto ensure the tasksynchroniza-                   (shown for an instance with at most three windows
 ion. We do so by buffering all input ports using FIFOs:several       current window. Sub-plan q2 thus sees a finite input for
    union operator ∪ is to accumulate the output of                      open in parallel).
                                                                     every execution and may, e.g., use aggregation in a seman-
   source streams into a single output stream. Since, in our
                                                                     tically sound manner.
                    4-way union
   case, all source streams operate truly in parallel, a hardware
                                     FIFOs                              Our compiler implements this semantics by wrapping q2
   implementation for ∪ needs to ensure proper synchroniza-
Examples
                                      5-way union                                              aggregation fun
                                                                                               Algebraic Ag
      0               0               1                 0                  0            CSR2   braic aggregate
       &              &               &                     &               &                  of state) [9] in
                WPrice          WPrice            WPrice              WPrice          WPrice   coun t , e.g., we
      eos
               wsum   eos
                              wsum    eos
                                                 wsum   eos
                                                                    wsum    eos
                                                                                    wsum       its trigger inpu
                                                                                               Once we reach
            &               &                &                  &                 &            the counter va
                                                                                               the counter to
           1              0                  1                  1               1       CSR1      In the transl
      counter                            ∗
adv
                          &                                         σa                            coun t a (q)
                                  a                ∗
                                       a
                                      =                             ○a:(Symbol,"UBSN")
                                                                    =
                            Symbol           "UBSN"     ∗                                      we forward the
                                                                                               to implement
                                  Trades
                                                                                               input of the co
                                                                                               constructs a ne
 Figure 6: Hardware execution plan for Query Q4 .                                              lar input value
                                                                                               aggregate (we
Examples

                     q1 grpx|c q2 ( x ) ⇒                       for instan
                                                                pressed u
                                    n -way union                shown her
   eos                                                          on the rig
                             ∗          ∗         ∗     ∗
                                                                ware side
                    eos    q2        q2          q2    q2
                                                                q1 q2 is
                      addr                                      implemen
          DEMUX                                                 rect the s
          data rst                                              put fields
                     CAM                                        put regist
                   data             c        ∗                  erated thi
                                        q1            (GrpBy)
                                                                valid. Hen

Figure 7: Compilation rule to implement the ‘group
                                                                  q1   q2
by’ operator grp.


causes a noticeable effect on the overall chip space consump-
tion.                                                             Again,
if data volumes become high. In this case, the data is writ-      If no componen
 ten into (ex ternal) memory, where logic on the F P G A picks   bound (such as a

Streaming De-Multiplexing
 it up autonomously after it has received a work request from
 the host C P U .
                                                                 plan optimizer ca
                                                                 and run (part of )
                                                                 idea to the plan f
 5.3 Stream De-Multiplexing                                      our actual implem
    A ctual implementations may depend on specialized func-
 tionality that would be inefficient to express using standard
 algebra components. In our use case, algorithmic trading,            &
 input data is received as a multiplexed stream, encoded in a
                                                                                =
 compressed variant of the F I X protocol [4]. E xpressed us-          Symbol       "UB
 ing the Stream B ase syntax, the multiplex stream contains
 actual streams like                                                            Trade
  CREATE INPUT STREAM NewOrderStream (                              A s shown in th
     MsgType         byte, -- 68: new order                      plan now runs bo
     ClOrdId         int, -- unique order identifier             and finishes with
     OrdType         char, -- 1:market, 2:limit, 3:stop
                                                                    T he most appa
     Side            char, -- 1:buy, 2:sell, 3:buy minus
                                                                 tion of latency. T
     TransactTime long) -- UTC Timestamp
                                                                 of one. In additio
  CREATE INPUT STREAM OrderCancelRequestStream (                 sources, primarily
    MsgType      byte, -- 70: order cancel request               before.
    ClOrdId      int, -- unique order identifier
    OrigClOrdId int, -- previous order                           6.2 Increasin
    Side         char, -- 1:buy, 2:sell, 3:buy minus               T he elimination
    TransactTime long) -- UTC Timestamp                          cally leads to task
                                                                 hardware circuit
    We have implemented a stream de-multiplexer that in-
 terprets the MsgType field (first field in every stream) and
                                                                                     V
 dispatches the tuple to the proper plan part.
                                                                     &
Optimization Heuristics



Reducing Clock Synchronization
Increasing Parallelism
Trading Unions for Multiplexers
Group By/Windowing Unnesting
Evaluation

                                 FPGA            software (Linux 2.6)   high pack
     packets processed   100 %                                          tuples.
                                  100 %             100 %
                          80 %                                             Our res
                                          60 %                          the expec
                          60 %
                                                                        attractive
                          40 %                              36 %
                                                                        is used as
                                                                        the remain
                          20 %
                                                                        load. Thi
                           0%                                           system ca
                                 300,000 pkt/s     1,000,000 pkt/s
                                          data input rate
                                                                        8.   REL
Figure 8: Number of packages successfully processed                        The ide
for two given input loads. The hardware implemen-                       cessing da
tation is able to sustain both package rates.                           explored w
                                                                        [3]. His D
                                                                        operated
of 1 means that the circuit can process 100 million input               database s
                                                                           While e
Related Work
   Database Processing Systems as Hardware implementation

       “database machine” [3]

            tailor-made hardware for database processing

       Netezza Performance Server system [2]

            SPUs = Disk, NIC, a CPU ,and an FPGA

       SQL Chip [14]
In Avalanche, we aim at exploiting the configura- bility of FPGAs. With Glacier, we present a compiler that
compiles queries from an arbitrary workload into a tailor- made hardware circuit.

   Processing Model

       MonetDB/X100 system[13] (is resembled)

   Specialized processors for use in a database context

       GPUTeraSort[8]

       stream joins for the Cell processor[6]

Más contenido relacionado

La actualidad más candente

SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)Kohei KaiGai
 
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_IndexKohei KaiGai
 
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...Kohei KaiGai
 
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012Joshua Mora
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storageKohei KaiGai
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrKohei KaiGai
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015Kohei KaiGai
 
Volta (Tesla V100) の紹介
Volta (Tesla V100) の紹介Volta (Tesla V100) の紹介
Volta (Tesla V100) の紹介NVIDIA Japan
 
20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdwKohei KaiGai
 
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_ENKohei KaiGai
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)Kohei KaiGai
 
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCINVIDIA Japan
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGaiKohei KaiGai
 
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Anne Nicolas
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - EnglishKohei KaiGai
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU deviceKohei KaiGai
 

La actualidad más candente (20)

SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)
 
numPYNQ: accelerating NumPy on PYNQ
numPYNQ: accelerating NumPy on PYNQnumPYNQ: accelerating NumPy on PYNQ
numPYNQ: accelerating NumPy on PYNQ
 
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
 
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
 
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
 
Volta (Tesla V100) の紹介
Volta (Tesla V100) の紹介Volta (Tesla V100) の紹介
Volta (Tesla V100) の紹介
 
20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw
 
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai
 
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
 
Debugging CUDA applications
Debugging CUDA applicationsDebugging CUDA applications
Debugging CUDA applications
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU device
 

Destacado (9)

Microblaze loader
Microblaze loaderMicroblaze loader
Microblaze loader
 
Reconf 201506
Reconf 201506Reconf 201506
Reconf 201506
 
Synthesijer hls 20150116
Synthesijer hls 20150116Synthesijer hls 20150116
Synthesijer hls 20150116
 
Das 2015
Das 2015Das 2015
Das 2015
 
Synthesijer zynq qs_20150316
Synthesijer zynq qs_20150316Synthesijer zynq qs_20150316
Synthesijer zynq qs_20150316
 
Synthesijer and Synthesijer.Scala in HLS-friends 201512
Synthesijer and Synthesijer.Scala in HLS-friends 201512Synthesijer and Synthesijer.Scala in HLS-friends 201512
Synthesijer and Synthesijer.Scala in HLS-friends 201512
 
Synthesijer jjug 201504_01
Synthesijer jjug 201504_01Synthesijer jjug 201504_01
Synthesijer jjug 201504_01
 
Synthesijer fpgax 20150201
Synthesijer fpgax 20150201Synthesijer fpgax 20150201
Synthesijer fpgax 20150201
 
Slide
SlideSlide
Slide
 

Similar a Streams on wires

Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentEricsson
 
Inter-Process communication using pipe in FPGA based adaptive communication
Inter-Process communication using pipe in FPGA based adaptive communicationInter-Process communication using pipe in FPGA based adaptive communication
Inter-Process communication using pipe in FPGA based adaptive communicationMayur Shah
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustEvan Chan
 
Automatically partitioning packet processing applications for pipelined archi...
Automatically partitioning packet processing applications for pipelined archi...Automatically partitioning packet processing applications for pipelined archi...
Automatically partitioning packet processing applications for pipelined archi...Ashley Carter
 
Nt1310 Unit 5 Algorithm
Nt1310 Unit 5 AlgorithmNt1310 Unit 5 Algorithm
Nt1310 Unit 5 AlgorithmAngie Lee
 
Interface for Performance Environment Autoconfiguration Framework
Interface for Performance Environment Autoconfiguration FrameworkInterface for Performance Environment Autoconfiguration Framework
Interface for Performance Environment Autoconfiguration FrameworkLiang Men
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGATO project
 
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...IRJET Journal
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryDeepak Shankar
 
Topology Aware Resource Allocation
Topology Aware Resource AllocationTopology Aware Resource Allocation
Topology Aware Resource AllocationSujith Jay Nair
 
Iaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd Iaetsd
 
Comparative Evaluation of Spark and Flink Stream Processing
Comparative Evaluation of Spark and Flink Stream Processing Comparative Evaluation of Spark and Flink Stream Processing
Comparative Evaluation of Spark and Flink Stream Processing Ehab Qadah
 
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)Yuuki Takano
 
The Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft ProcessorThe Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft ProcessorDeepak Tomar
 
Design and implementation of Parallel Prefix Adders using FPGAs
Design and implementation of Parallel Prefix Adders using FPGAsDesign and implementation of Parallel Prefix Adders using FPGAs
Design and implementation of Parallel Prefix Adders using FPGAsIOSR Journals
 
Dynamic classification in silicon-based forwarding engine environments
Dynamic classification in silicon-based forwarding engine environmentsDynamic classification in silicon-based forwarding engine environments
Dynamic classification in silicon-based forwarding engine environmentsTal Lavian Ph.D.
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 

Similar a Streams on wires (20)

Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
FPGA IMPLEMENTATION OF APPROXIMATE SOFTMAX FUNCTION FOR EFFICIENT CNN INFERENCE
FPGA IMPLEMENTATION OF APPROXIMATE SOFTMAX FUNCTION FOR EFFICIENT CNN INFERENCEFPGA IMPLEMENTATION OF APPROXIMATE SOFTMAX FUNCTION FOR EFFICIENT CNN INFERENCE
FPGA IMPLEMENTATION OF APPROXIMATE SOFTMAX FUNCTION FOR EFFICIENT CNN INFERENCE
 
Inter-Process communication using pipe in FPGA based adaptive communication
Inter-Process communication using pipe in FPGA based adaptive communicationInter-Process communication using pipe in FPGA based adaptive communication
Inter-Process communication using pipe in FPGA based adaptive communication
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Automatically partitioning packet processing applications for pipelined archi...
Automatically partitioning packet processing applications for pipelined archi...Automatically partitioning packet processing applications for pipelined archi...
Automatically partitioning packet processing applications for pipelined archi...
 
Nt1310 Unit 5 Algorithm
Nt1310 Unit 5 AlgorithmNt1310 Unit 5 Algorithm
Nt1310 Unit 5 Algorithm
 
Interface for Performance Environment Autoconfiguration Framework
Interface for Performance Environment Autoconfiguration FrameworkInterface for Performance Environment Autoconfiguration Framework
Interface for Performance Environment Autoconfiguration Framework
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP Library
 
Topology Aware Resource Allocation
Topology Aware Resource AllocationTopology Aware Resource Allocation
Topology Aware Resource Allocation
 
Iaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg as
 
Intermediate Fabrics
Intermediate FabricsIntermediate Fabrics
Intermediate Fabrics
 
Comparative Evaluation of Spark and Flink Stream Processing
Comparative Evaluation of Spark and Flink Stream Processing Comparative Evaluation of Spark and Flink Stream Processing
Comparative Evaluation of Spark and Flink Stream Processing
 
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
 
The Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft ProcessorThe Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft Processor
 
Design and implementation of Parallel Prefix Adders using FPGAs
Design and implementation of Parallel Prefix Adders using FPGAsDesign and implementation of Parallel Prefix Adders using FPGAs
Design and implementation of Parallel Prefix Adders using FPGAs
 
Dynamic classification in silicon-based forwarding engine environments
Dynamic classification in silicon-based forwarding engine environmentsDynamic classification in silicon-based forwarding engine environments
Dynamic classification in silicon-based forwarding engine environments
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 

Más de Takefumi MIYOSHI

Más de Takefumi MIYOSHI (20)

ACRi_webinar_20220118_miyo
ACRi_webinar_20220118_miyoACRi_webinar_20220118_miyo
ACRi_webinar_20220118_miyo
 
DAS_202109
DAS_202109DAS_202109
DAS_202109
 
ACRiルーム1年間の活動と 新たな取り組み
ACRiルーム1年間の活動と 新たな取り組みACRiルーム1年間の活動と 新たな取り組み
ACRiルーム1年間の活動と 新たな取り組み
 
RISC-V introduction for SIG SDR in CQ 2019.07.29
RISC-V introduction for SIG SDR in CQ 2019.07.29RISC-V introduction for SIG SDR in CQ 2019.07.29
RISC-V introduction for SIG SDR in CQ 2019.07.29
 
Misc for edge_devices_with_fpga
Misc for edge_devices_with_fpgaMisc for edge_devices_with_fpga
Misc for edge_devices_with_fpga
 
Cq off 20190718
Cq off 20190718Cq off 20190718
Cq off 20190718
 
Synthesijer - HLS frineds 20190511
Synthesijer - HLS frineds 20190511Synthesijer - HLS frineds 20190511
Synthesijer - HLS frineds 20190511
 
Reconf 201901
Reconf 201901Reconf 201901
Reconf 201901
 
Hls friends 201803.key
Hls friends 201803.keyHls friends 201803.key
Hls friends 201803.key
 
Abstracts of FPGA2017 papers (Temporary Version)
Abstracts of FPGA2017 papers (Temporary Version)Abstracts of FPGA2017 papers (Temporary Version)
Abstracts of FPGA2017 papers (Temporary Version)
 
Hls friends 20161122.key
Hls friends 20161122.keyHls friends 20161122.key
Hls friends 20161122.key
 
Synthesijer.Scala (PROSYM 2015)
Synthesijer.Scala (PROSYM 2015)Synthesijer.Scala (PROSYM 2015)
Synthesijer.Scala (PROSYM 2015)
 
ICD/CPSY 201412
ICD/CPSY 201412ICD/CPSY 201412
ICD/CPSY 201412
 
Reconf_201409
Reconf_201409Reconf_201409
Reconf_201409
 
FPGAのトレンドをまとめてみた
FPGAのトレンドをまとめてみたFPGAのトレンドをまとめてみた
FPGAのトレンドをまとめてみた
 
Vyatta 201310
Vyatta 201310Vyatta 201310
Vyatta 201310
 
Fpgax 20130830
Fpgax 20130830Fpgax 20130830
Fpgax 20130830
 
Ptt391
Ptt391Ptt391
Ptt391
 
Fpgax 20130604
Fpgax 20130604Fpgax 20130604
Fpgax 20130604
 
Fpga local 20130322
Fpga local 20130322Fpga local 20130322
Fpga local 20130322
 

Último

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 

Último (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Streams on wires

  • 1. Streams on Wires — A Query Compiler for FPGAs Rene Mueller Jens Teubner Gustavo Alonso rene.mueller@inf.ethz.ch jens.teubner@inf.ethz.ch alonso@inf.ethz.ch Systems Group, Department of Computer Science, ETH Zurich, Switzerland ABSTRACT The Avalanche project at ETH Zurich addresses the chal- Taking advantage of many-core, heterogeneous hardware for lenging questions raised by these scenarios. In Avalanche, data processing tasks is a difficult problem. In this paper, we we study the impact of modern computer architectures on data processing. As part of the efforts around Avalanche, : consider the use of FPGAs for data stream processing as co- processors in many-core architectures. We present Glacier, we assess the potential of using FPGAs as additional cores a component library and compositional compiler that trans- in many-core machines and how they could be exploited for 2010.06.17 forms continuous queries into logic circuits by composing library components on an operator-level basis. In the pa- data processing purposes. In this paper, we report on the work we have done devel- per we consider selection, aggregation, grouping, as well as oping Glacier, a library of components and a basic compiler windowing operators, and discuss their design as modular for continuous queries implemented on top of an FPGA. elements. The ultimate goal of this line of work is to develop a hy- We also show how significant performance improvements brid data stream processing engine where an optimizer dis- can be achieved by inserting the FPGA into the system’s tributes query workloads across a set of CPUs (general- data path (e.g., between the network interface and the host purpose or specialized) and FPGA chips. In here, we focus CPU). Our experiments show that queries on the FPGA on how conventional streaming operators can be mapped to
  • 2. Abstract ABSTRACT The Ava Taking advantage of many-core, heterogeneous hardware for lenging qu data processing tasks is a difficult problem. In this paper, we we study consider the use of FPGAs for data stream processing as co- data proc processors in many-core architectures. We present Glacier, we assess a component library and compositional compiler that trans- in many-c forms continuous queries into logic circuits by composing data proce library components on an operator-level basis. In the pa- In this p per we consider selection, aggregation, grouping, as well as oping Glac windowing operators, and discuss their design as modular for contin elements. The ultim We also show how significant performance improvements brid data can be achieved by inserting the FPGA into the system’s tributes q data path (e.g., between the network interface and the host purpose or CPU). Our experiments show that queries on the FPGA on how co can process streams at more than one million tuples per circuits on second and that they can do this directly from the network, over data removing much of the overhead of transferring the data to to operate a conventional CPU. actual per system. The pap
  • 3. uples per network, over data streams; the proper system setup for the FPGA e data to Contributions to operate in combination with an external CPU; and the actual performance that can be reached with the resulting system. The paper makes the following contributions: 1. We describe Glacier, a component library and compiler rds many- for FPGA-based data stream processing. Besides clas- pportuni- sical streaming operators, Glacier includes specialized rocessing building blocks needed in the FPGA context. With the t go away operators and specialized building blocks, we show how r the I/O Glacier can be used to produce FPGA circuits that im- ow to ex- plement a wide variety of streaming queries. ow to deal host com- 2. Since FPGAs behave very differently than software, we machines provide an in-depth analysis of the complexity and per- provides formance of the resulting circuits. We discuss latency PUs) next and issue rates as the relevant metrics that need to be dors have considered by an optimizing plan generator. e machine e floating 3. Finally, we evaluate the end-to-end performance of an de larger FPGA inserted between the network and the CPU, run- be CPUs ning a query compiled with Glacier. Our results show that the FPGA can process streams at a rate beyond one million tuples per second, far more than the CPU could. ed provided Our work is organized as follows. The upcoming Sec-
  • 4. which typically indicate a turbulent market situation. At Table 1: Supported streaming a Motivating Applications the same time, latency is critical and measured in units of names; q, qi : sub-plans; x: para microseconds (µs). input). To abstract from the real application, we assume an input stream that contains a reduced set of information about each trade handled by Eurex (the actual streams are implemented of UBS stocks (similar functionality with a Swiss bank as a compressed representation of the feature-rich FIX pro- plement finite-impulse response filter tocol [4]). financial trading application receives data from a setdetermines the average trade prices Their Expressed in the syntax of StreamBase [16], the of streams with up-to-date market information from different stock exchanges. schema of our data would read: over the last ten-minutes window: CREATE INPUT STREAM Trades ( SELECT count () AS Number Seqnr int, -- sequence number FROM Trades [ SIZE 600 ADVANC Symbol string(4), -- valor symbol WHERE Symbol = "UBSN" Price int, -- stock price INTO NumUBSTrades Volume int) -- trade volume SELECT wsum (Price, [ .5, .25, .12 To keep matters simple, we look at queries that process FROM (SELECT * FROM Trades a single data stream only. In particular, we disallow the WHERE Symbol = "UBSN") use of joins. To facilitate the allocation of resources on the [ SIZE 4 ADVANCE 1 TUPLE FPGA, we restrict ourselves to queries with a predictable INTO WeightedUBSTrades space requirement. We do allow aggregation queries and SELECT Symbol, avg (Price) AS Av windowing; in fact, we particularly look at such functionality FROM Trades [ SIZE 600 ADVANC in the second half of Section 4. GROUP BY Symbol These restrictions can be lifted with techniques that are INTO PriceAverages no different to those applied in software-based systems. The We use these five queries in the fol necessary FPGA circuitry, however, would introduce addi- various features as well as the composi tional complexity and only distract from the FPGA-inherent compiler. considerations that are the main focus of this work. 2.3 Algebraic Plans 2.2 Example Queries Input to our compiler is a query rep
  • 5. 1 x|k,l 2 Input to our compiler is a Our first set of example queries substituted on each wind.; apply q2 with x is designed to illustrate bra for streaming queries. O Example Queries a hardware-based implementation for the most basic oper- t ∈ {time, tuple}: time-, or tuple-based the algebra dialect listed in T ators 1in stream processing. Queries Q1 and Q2 field simple q q2 concatenation; position-based use join be composed in an arbitrary projections as well as selections and compound predicates: Our algebra uses an “asse Table 1: Price, Volume SELECT Supported streaming algebra (a, b, c: field breaks down selection, arithm names; q, qi : sub-plans; x: parameterized sub-plan FROM Trades into separate algebra operat input). Symbol = "UBSN" (Q1 ) simple projections as well as selections and similar representation also su WHERE compound predicates INTO UBSTrades tics of XQuery [11]. In the c notation turns out to have n ofSELECTstocks (similar functionality is used, e.g., to im- UBS Price, Volume flow in a hardware circuit an FROM Trades plement finite-impulse response filters). Finally, Query 2 )5 (QQ for parallel evaluation. determines the average trade prices for > 100000 symbol WHERE Symbol = "UBSN" AND Volume each stock Figure 1 illustrates how ou overINTO last ten-minutes window: the LargeUBSTrades to express the semantics of Q SELECT count () AS Number Financial analytics often depend on statistical information how in Figure 1(a), e.g., ope counts the number of trades of UBS shares over FROM Trades [ SIZE 600 ADVANCE 60 TIME ] from the data stream. Using sliding-window and grouping) the last 10 each (600 seconds) and returns the of minutes input tuple with the r (Q3 aggregate every minute. WHERE Symbol =Q3 counts the number of trades of UBS functionality, Query "UBSN" explicit. Its output, the new shares INTO the last 10 minutes (600 seconds) and returns over NumUBSTrades to filter out non-qualifying t the aggregate every minute. In Query Q.125 ]) AS Wprice SELECT wsum (Price, [ .5, .25, .125, 4 , we assume the The concatenate operator FROM (SELECT * FROM Trades presence of an aggregation function wsum that computes the the presence of an aggregation by position”. T called a “join function wsum that weighted sum WHEREthe prices"UBSN") the last four trades) over Symbol = seen in (Q4 computes the weighted sum over the prices seen in are combined into a wide re the last four trades of UBS stocks (similar [ SIZE 4 ADVANCE 1 TUPLES ] functionality is used INTO WeightedUBSTrades SELECT Symbol, avg (Price) AS AvgPrice FROM Trades [ SIZE 600 ADVANCE 60 TIME ] the average trade prices for each stock symbol (Q5 ) over the last ten-minutes window GROUP BY Symbol INTO PriceAverages
  • 6. Algebraic Plans πa1 ,...,an (q ) projection σa (q ) select tuples where field a contains true ○a:(b1 ,b2 ) (q ) arithmetic/Boolean operation a = b1 b2 llabora- q1 ∪ q2 union plication agg b:a (q ) aggregate agg using input field a , market agg ∈ {avg, count, max, min, sum} rmation packages q1 grpx|c q2 ( x ) group output of q1 by field c, then a rate at invoke q2 with x substituted by the group s rate is q1 t x|k,l q2 ( x ) sliding window with size k , advance by l ; 5]. apply q2 with x substituted on each wind.; ] cannot t ∈ {time, tuple}: time-, or tuple-based ential fi- q1 q2 concatenation; position-based field join = join by position uations, ion. At Table 1: Supported streaming algebra ( a, b, c: field units of names; q, qi : sub-plans; x : parameterized sub-plan input). an input out each
  • 7. .2 q1 Example Queries t q2 (x) sliding window with size k, advance by l; Input to our compiler is a query representation x|k,l Our first set of example queries substituted on each wind.; apply q2 with x is designed to illustrate Algebraic Plans bra for streaming queries. Our compiler currentl hardware-based implementation for the most basic oper- t ∈ {time, tuple}: time-, or tuple-based the algebra dialect listed in Table 1, whereby ope tors 1in stream processing. Queries Q1 and Q2 field simple q q2 concatenation; position-based use join be composed in an arbitrary fashion. rojections as well as selections and compound predicates: Our algebra LargeUBSTradesLargeU BS Trades NumU BS Trades uses an “assembly-style” represent NumUBSTrades UBSTrades π Table 1: Price, Volume SELECT Supported streaming algebra (a, b, c: field breaks down selection, arithmetics, and predicate U BSPrice,Volume Trades time πPrice,Volume x|600,60 time Wei x|600,60 names; q, qi : sub-plans; x: parameterized sub-plan FROM Trades into separate algebra operators. In earlier work, w πPrice,Volume σc πPrice,Volume Trades σc coun t Number input). Symbol = "UBSN" (Q1 ) similar representation also suited to express,countNum ○c:(a,b) ∧ Trades e.g., WHERE σa ○c:(a,b) ∧ σa σa INTO UBSTrades > σa In the tics of XQuery○b:(Volume,100000)> context of the current [11]. σa ○b:(Volume,100000) ○a:(Symbol,"UBSN") = ○a:(Symbol,"UBSN") ○a:(Symb notation turns a:(Symbol,"UBSN") ○ ○a:(Symbol,"UBSN") nice correspondences t = out to have = = = ○a:(Symbol," = ofSELECTstocks (similar functionality is used, e.g., to im- UBS Price, Volume ○a:(Symbol,"UBSN") = flow in a hardware circuit and helps detecting opp FROM Trades Trades Trades x Trad plement finite-impulse response filters). Finally, Query 2 )5 Q Trades Trades x (Q for parallelQevaluation.Q2 (a) Query 1 (b) Query (c) Query Q3 determines the average trade prices for > 100000 symbol WHERE Symbol = "UBSN" AND Volume each stock FigureBS Trades (a) Query Q1 (b) Query Q2 1 illustrates how our streaming algebra c (c) Query Q3 overINTO last ten-minutes window: the LargeUBSTrades LargeU BS Trades NumU to express the semanticsAlgebraic query plans for PriceAver Figure 1: Queries Q1 throughfive of Figure 1: Algebraic query plan the Q U BS Trades πPrice,Volume time WeightedU BS Trades SELECT count () AS Number Financial analytics often depend on statistical information LargeUBSTrades how in Figure 1(a), e.g., operator ○ makes the c NumUBSTrades x|600,60 = πPrice,Volume σc 6-to-1 lookup tables 69,120 a time piece FROM Trades [ SIZE 600sliding-window and grouping rom the data stream. Using ADVANCE 60UBSTrades TIME ] tuple Tradesinput tuple 6-to-1 lookup requested stock symbo (Q3 )∧ c:(a,b) of each flip-flops (1-bit flip-flopsthe registers) πPrice,Volume with time countNumber registers) tables 69,120 x|4,1 69,120 ing the PriceAverages x|600, WeightedUBSTrades WHERE Symbol = "UBSN" ○ x|600,60 (1-bit 69,120 unctionality, Query Q3 counts the number of trades of UBS πPrice,Volumeσa c explicit. block RAM block RAM 296×18WPrice:(Price,[··· ]) is usedgra Trades output, the a 25×18-bit countNumber new column a, kbit RAM Its σa multipliers σtuple wsum kbit 296×18 Trades 64 time progra INTO the last 10 minutes (600 seconds) and returns NumUBSTrades ○b:(Volume,100000) > x|4,1 25×18-bit multipliers (operator σ ). x|600,60 64 progra hares over ○a:(Symbol,"UBSN") ○c:(a,b) = ∧ to filter out non-qualifying rate MHz typical clock rate ○a:(Symbol,"UBSN")typical clock = tuples 100 ○a:(Symbol,"UBSN") = 100 MHz tribute ax a ○a:(Symbol,"UBSN") = x wsumWPrice:(Price,[··· ]) Trades grpy|Sym he aggregate every minute. In Query Q.125 ]) AS ○b:(Volume,100000) SELECT wsum (Price, [ .5, .25, .125, 4 , we assume the Wprice > The concatenate operator a a Table 2: Xilinx XC5VLX110T characteristics. represents what the FP FROM (SELECT * FROM Trades ○a:(Symbol,"UBSN") resence of an aggregation function wsum that computes the Trades called a:(Symbol,"UBSN") ○a:(Symbol,"UBSN") Tuples from both inp = Trades ○a:(Symbol,"UBSN") = ○ a “join xby = = position”. x Trades Table 2: Xilinx XC5VLX110T characteristics. x A pa avg of (e) cont weighted sum WHEREthe prices"UBSN") the last four Q1 (Q4 ) Query Q2 they arrive.Queryoperator a necessary, for instance, to evalu- the tionalQ over Symbol = seen in (a) Query Trades trades (b) Trades are combined Q3 (c) (d) Query Q4 The into is wide result tuple in x they arrive.Trades orde m [ SIZE 4 ADVANCE 1 TUPLES ] ate and return different aggregation functions over thefor instance, to dat The operator is necessary, same by eva (a) Query Q1 (b) Query Q2 1: Algebraic query plans return different aggregation functions overQuery Figure input stream.3 ate and for the five example queries Q1 to Q5the sa (c) Query Q (d) Query Q4 . (e) Typica INTO WeightedUBSTrades input stream. LargeUBSTrades NumUBSTrades the add UBSTrades Symbol, avg (Price) AS AvgPrice SELECT 3. FPGAS FOR STREAM PROCESSING to 6-to-1Figure tables PriceAverages plans for the five example queries Q1 to Q5 .the RAM c lookup 1: Algebraic query tionalit πPrice,Volume time WeightedUBSTrades 69,120 3. FPGAS FORof data, its address, flip-flop reg a piece STREAM PROCESSING constan FROM Tradesσ [ SIZE 600x|600,60 ADVANCE 60 TIME ] flip-flops (1-bit registers) time its69,120heart, every FPGA chip dataFPGA chipFPGAs, of three mr At very ing the consists of three main back. In πPrice,Volume c block RAM5 ) tuple (Q types of components. its a piece of data, distributed overthe RAM and tig 296×18 kbit At A large heart, every very number of lookup tables (LUTs)chip RAM are consists the We GROUP BY Symbol Trades countNumber 6-to-1 lookup tables x|4,1 25×18-bit multipliers 69,120 x|600,60 its address, to lookup tables (LU 64 types of ing the of logic large number lookup components. A gates. Each of provides a programmable type data back. In FPGAs,addition, registers programmable logic. In chip, t tion prt lookup ∧ c:(a,b) flip-flops (1-bit registers) 69,120 flip-flop INTO PriceAverages programmed at runtime gates. beimplem provides aarbitrary 6 bit → 1 bit logicand thusEach look programmable type of function. used a σa σa σa wsumWPrice:(Price,[··· ]) Trades kbit MHz typical clock rate block RAM table100 implement an can 296×18 grpy|Symbol RAM are distributed over the chip and tightly w memor > b:(Volume,100000) 25×18-bit multipliers Lookup tables are wired through an an arbitrary such, → 1 bit tables 64 table can implement memory. As 6 bit the on-chip sto tributed programmable interconnectan interconnect netwo network functi logic. In addition, lookup The ac = a:(Symbol,"UBSN") these five queries in the following to demonstrate Lookup tablesthe FPGA can be accessed in a truly paral are wired through We use = = a:(Symbol,"UBSN") = a:(Symbol,"UBSN") Xilinx XC5VLX110T x can route that canacross the chip.runtime and thus Finally, rare up Table 2: typical clock rate x 100 MHz avgAvgPrice:Price programmed at across the flip-flops used as add that signals characteristics. Finally, route 1-bit storage units that can signals be A particular use ofchip. potential flip-flo this is the a:(Symbol,"UBSN") (also called registers) provide memory. As such, the on-chip storage r tributed various features as well as the compositionality of the Glacier provide 1-bit storage unitsthat is (also called registers)logic. of content-addressable memory (CAM). c directly be wired into the remaining be accessed in a truly parallel fasthat O Trades Trades x Table 2: Xilinx XC5VLX110T characteristics.yto evalu-the wired into the remaining logic. Trades FPGA can they arrive. The operator is necessary, Theinstance, directly betables, the wiring of the intercon- for behavior of lookup lookup tional memory, content-addressable memor compiler. (b) Query Q A particular use of this potentialof thelatency is theinterco ate and return different aggregation nect, Query Q5 initial state of byof lookup tables, all be con- explicit me functions over theThe behavior data values, rather wiring impl (a) Query Q1 2 (c) Query Q3 (d) Query Q4 (e) and the same the flip-flops can the than by input stream. nect, The content-addressable the flip-flops can all a co of initial state of memory (CAM). Other t figured by software. and the Typically,content-addressable memory begiv implem 2.3 Algebraic Plans actual configuration are typically resolve can they arrive. The operator is necessary, for instance, to evalu- CAMs is used to tional memory, figured by software. Thelanguageconfiguration istwo cy actual (such as typica Figure 1: Algebraic query plans for the five example queries Q1 to Q5described using a hardware descriptionrather than by explicit More ge . ate and return different aggregation functions over the same the address it has been stored at. byusing a hardware description language (such data values, memory described VHDL or Verilog) and loadedtionality can implement an arbitrary key- into the FPGA.
  • 8. FPGAs for Stream Processing notification algebraic VHDL query plans Code Xilinx query plan CPU Glacier Synthesizer network the FPGA is directly connected to the physical network interface, NIC π σ (a) with parts of the network controller implemented inside the Main FPGA fabric. data Memory stream Figure 3: Compilation of abstract que FPGA notification hardware circuits for the FPGA. query plan CPU sub-plan adheres to the same well-defined wi data π σ (b) stream Main Memory 4.1 Wiring Interface the FPGA can also be used in a traditional co-processor setup As in data streaming engines, our processin FPGA tirely push-based. Each n-bit-wide tuple is a set of n parallel wires in the FPGA fabric Figure 2: System Architectures: (a) Stream engine wires, a new tuple can be propagated in eve between network interface and CPU. (b) Stream FPGA’s system clock (i.e., 100 million tuple engine as a coprocessor to the CPU. An additional data valid line signals the pres in a given clock cycle. Tuples are only consid of the data stream if their data valid flag is s tem main memory, then informs the host CPU about the if the data valid line carries an electrical “hig arrival of new data (e.g., by raising an interrupt). The can also Figure 2 (a) fits pattern that is highly common inthe following, we use rectangles to repre Alternatively, the FPGA architecture inbe used in aa traditional In data stream applications. ponents (with the exception of multiplexers co-processor setup, as illustrated in Figure 2 (b). Here, the use the common trapezoid notation). Our CPU hands over data to the FPGA either by writing directly clock-driven or synchronized and every oper into FPGA registers (so-called slave registers) or it prepares brary writes its output into a flip-flop regi
  • 9. Query Compilation algebraic VHDL query plans Code Xilinx CPU Glacier Synthesizer (a) circuit on FPGA Main emory Figure 3: Compilation of abstract query plans into hardware circuits for the FPGA. The compiler applies compilation rules (Section 4) and op- timization heuristics (Section 6), then emits the description of a logic circuit that implements the input plan. CPU sub-plan adheres to the same well-defined wiring interface. (b) Main 4.1 Wiring Interface emory As in data streaming engines, our processing model is en- tirely push-based. Each n-bit-wide tuple is represented as a set of n parallel wires in the FPGA fabric. On a set of ream engine wires, a new tuple can be propagated in every cycle of the (b) Stream FPGA’s system clock (i.e., 100 million tuples per second).
  • 10. down the data path, as shown in R ule Proj: & & rator’s leftmost observableFor orresponds to the output.com- rectangles to represent logic response time, whereas 4: Compiled hardware execution plan for ue we multiplexers, throughput. rate determines black-box ,ion of depict thefor which we q Wiring Interface Qhardware Our circuits are all circuit is 3, issue rate 1. a 1 . Latency of this d notation). implementation Synchronization the right. zed and every operator in our li- πa1 ,...,an (q) ⇒ ··· .1 (Proj) 1 ery q as shown on after pro- nto a flip-flop register also need to be considered dur- a1 an ∗ se arrows sometimes boxes and properties to denote the wiring between hardware gisters as gray-shaded q ry compilation. For instance, all compilation rules plansthat a generated circuit willwe identifythis islines T his implementation for πa1 ,...,an has an interesting side ents. Wherever appropriate, never try to those also the rate explicit as nsure have an issue rate of one, push tuple For that correspondan operator thattuple field utput. ples in bus to a specific has black-box plan.cycles into successive omplete than one. We use two types of logic Our circuits are allOclock-driven or synchronized and every q effect. ur compiler emits the description of a hardware cir- 2 abel at less respective input/output port. The labelin ourthat iswrites its output into synthesizer to generate the actual e rate the mentation cuit library passed into a a flip-flop register operator ds for “all remaining fields”. We between sub- after processing. configuration for the F P G A . T he synthesizer op- nentsright. the to implement the synchronization do not represent hardware Union : ote the wiring betweentuple. The hardware plan for the r of fields within a hardware timizes out “dangling wires”, effectively implementing pro- Figure 5: Compilation r expression σa (q)flag thus be for streams with of an algebraic for free. (shown for an instance ppropriate, we identify those lines illustrated as a data flow data can registers queues act asvalidpoint of view, the task short-term buffers jection pushdown orrespond to a specific tuple field T here is no actual work to do at runtime (though fields varying data rate. T hey emit data at a predictable tive input/output port. The label perator ∪ is notrate of an upstream sub-circuit. ate, typicallyWe do ning fields”. to accumulate the outputare propagated into a new set ofin parallel). and the issue represent of several open registers). L atency treams hardware the for theoutput stream. Since,rate of this implementation for pro jection are both 1. tuple. The runtime, single . & as with the output rate. σa(q) issue in our aote that, atinto a plan average input rate must not xceed what illustrated can thus be is achievable 4.3 Arithmetics and Boolean Operations source streams a ∗ truly in parallel,circuit, the logical ʻandʼ gate invalidates the output operate In this a hardware n most practical cases, the depth of the F I F O can be tuple wheneverindicated in false. 1, we use the generic ○a:(b ,b ) op- A s field a contains Table ntation for ∪ needs to ensure proper synchroniza- arithmetic computations, value compar- q ept very low.. T his not only implies a small resource erator to represent 1 2 eotprint, but also means that the impact onports usingisons or Boolean connectives in relational plans. Sub-plan do ∗ by buffering all input the over- a so FIFOs: current window. T he in- ircuit, the typically ‘and’ gate invalidates the output l latency is logical small. stance every execution and may, e q henever field a contains false. FIFO −n operatorsinvalidates block data items for a fixed num- can the union z 4-way output 1 ○tically sound manner. = a:(b ,b ) (q) , 2 ‘and’ gate FIFOs ntains false.Characteristics e.g., to properly syn- Circuit er of cycles n. T his can be used, e.g., will emit all fields in q, Our compiler field a that ex tended by a new implements hronize the output of slow arithmetic operators with contains the outcome of b1 = b2a template circuit (full bove circuittuple flow (the circuit below implements clock teristics he remaining will compute its output in a single into . d will its output in to consume a of input tuple at T his semantics directlyure 5). We an implementation add ○a:(b ,b )be ready a that the latency new is 2): (q); assume single clock translates into introduce an teconsume clock.inside the union component logic (wefor- a similar circuit a moment ago): ompute 1 machine input tuple that its latency and issue then saw 2 ck of the a new We say at in to uples1. In its the input FIFOs in more than one fashion We say from latency circuits may need a round-robin stream”) next to the data v both that general, and issue a l, circuits may need theirthan one a Delay operators (q) ⇒ a window closes to notify t more computation can be picked up ○ = a:(b ,b ) = (Equals) heir them as can beunion result. ts the result of −2 picked −2 computation the til 1 2 . up perator output—they larger a latency that is larger the union fixed number of cycles n bof ∗that window. T z theyeverylatency that is have gh z . have a individual input may feed z can block data items for a into −n elements 2 b1 Due circuits semantics, ∗ b1 b2 mantics,to theirthat implement circuits that implement sub-plan to start generating q g or at an arbitrary produce output before they ent windowing cannot they rate (i.e., issue rate 1), the annot produce output before tuple q of the respective query streams together must Most simple arithmetic or Boolean operations userun within rate of all input window. not exceed A common will case are s en the last to be theof the respective query window. single clock cycle. More complex tasks, such as multipli- tuple number of a efine latency ples belong to several wind
  • 11. Synchronization & & & & of Though[9] ain orindividual input may feed may . require (Proj) cation / division, a straightforward fashion. To implement π 1 ,...,a floating-point arithmetics, into the union state) every n (q) ⇒ ··· earlier how our assembly-style selection WPrice rice WPrice opera- coun t , e.g., we use Sometimes, counter component and wire additional latency. a standard the actual acircuit that im- a1 n ∗ WPriceneed WPriceconsidered dur- component at an arbitrary tuple rate (i.e., issue rate 1), the operties sometimes also circuit. Compilation to be plements can be tuned within the trade-offs latency, is- Query Units be cast into a hardware q wsum eosFor instance, all compilation rules average rate of all the data valid signal of the input stream. wsum eos wsum eos wsum compilation. translation in the notation we also its trigger input to input streams together must not exceed ormalizes this Once weand chip spaceper cycle, whichthe latency of(a) emit sue rate, reach consumption. If is eos ure that a of this& remainder generated circuit will never try toto work. We & the ⇒ & use more than one the endoperatorsπhave to is the maximum tu- tuple of the current stream, we symbol push the counter implementation for a1 ,...,an be introduced to greaterTthan one, delay his value to the upstream data pathan interesting side has and & s in successive cycles and, as before, assume synchronize the operator effect. to component the forward (b) reset ple rate thatur compiler emits the description of a up-stream the union output with can remaininghardware cir- e “compiles to” relationinto an operator that has the counterO zero to prepare for the next input stream. fields rate 0 labeled q 1is the We 1usethat results from angleless than one. circuit two types CSR1 of logic the shownpath. In terms of 4.1.2). (as data that is passed into latency, the state machine inside before in Section a 1 Incuit translation rulesinglesynthesizer to generate the actual the for coun t a (q), process. The FIFOs nts logical ‘and’ gate synchronization between sub- cy- operator requires a he to implement the & completes within a single the q: hardware configuration for cycleFtoG A . T he synthesizer op- the P T herefore, the∗latency and the issue rate of the circuit the input, ith the rules we have either flip-flop can now or at timizes W implementedwires”, seen so far, implementing pro- Example. out “dangling using effectively we registers rated for σa are both 1. eos a block RAMpushdown fortrade-off), a hardware circuit. In cy- translate our (a resource query into add another latency jection first example free. we the & σa ere,σa& act ⇒ short-term buffers .for a1 ,...,a(Sel) discard igure Tthereillustrated the circuit to has at minimum(though fields eues(q)use as pro jection operator π streams with cle.coun a (q) ⇒ 4, overall circuit therefore do a runtime (Count) F The we is no actual work that results from applying of , latency n to counter rst srying datatuple flow. Supportdatafieldarenaming (often Depending on theto Q uery Q1set of registers). L atency and from the a rate. T heyaemit for at predictable 2. compilation rules into a new . distribution, however, the ∗ ∗ our are propagated input data ∗ essed usingthe issue rateqof an upstream sub-circuit. observed latency this implementation reflects tuples queue up 1. , typically the π operator) is a straightforward ex ten- T he hardware circuit be higher whenever the shape of a issue rate of may quite literally for pro jection are both q eof what we present the averagea:(Symbol,"UBSN") that, at runtime, here. ○ input rate must not in an input FIFO. E ach of the operators can individually = = the algebraic plan. scarding is field"UBSN" leavestuple output rate. means to Strictly speaking,signal to the data validOperations t the Symbol achievable the all tuples essentially resulting circuit ∗ ed what a from with the flow simply we forwardArithmetics and Booleanand issue sufficient 4.3 operate in a the eoscyclebinaryhave latency single a (i.e., union component is rates output register wire instance, could tuples by set ting their data invalidates discarded be ex- to one). s Since all plan operators are applied sequentially, of most the respective outputdepth of the F I F O canfurther implement the algebraic ∪ operator: generic ○a:(b1 ,b2 ) op- practical cases, the ports with any inputs be to implement (a) and Table the same the for NumUBSTrades A indicated in feed 1, we use signal into the reset ntothe data Tpath, as shown in R ule nature time the pressed using the very similar in Proj: to his is false. Trades query plan latencies add counter the circuit in F igure 4 has an overall t up and to implement (b). Note vectors”here following column shown that Monet B / X implies a small resource input of the represent arithmeticissue rate of a that coun a very low. T his notDonly 100 [13] uses to avoid x|600,60 erator three. B y contrast, the computations, value compar- to latency of a new output field without reading any particu- pipelined print, but also On the that the impact on the over- constructsplanBoolean connectives slowest sub-plan. SinceT he in- ng. the right. on means hard- Trades isons or is determined by itsunion execution 2-way in relational plans. Hardware (q) ⇒ small. · · ·for Query Q4 . (Proj) input value. The operator emits no other field but the atencyside, execution plan lar stance ware is typically πa1 ,...,an the semantics of . a1 an ∗ minPrice q ∪q ⇒ maxaggregate1 (we2 handle grouping separately,. see next). For (Union) q1 q2 is straightforward to Price ∗ ∗ implement. can block di- items for a fixed num- the algebraic aggregates we=consider, coun t , sum, avg, m i n, perators z −nWe simply data q σa ○a:(b1 ,b2 q(q) , q1 )2 es cycles n. Topen window. e.g., to properly syn- and max, the latency is one cycle. A tuple can be applied at of thethe signals frombe used, oldest his can ○a:(Symbol,"UBSN") = his implementation for all 1in- n has an interesting side e.g., will emit all fields in q, ex tended by a new field a that rect πa ,...,a nize fields to a common implements operators with As we willeveryin the cycle (the issue rate is 1). put the output of slow arithmetic he hardware circuit that out- x the input see clock following, however, the availability of t. O ur compiler emits the description theasliding- cir- contains the outcome of b1 = b2 . of hardware remaining tuple Figure 6. circuitthe windowing put register set. flow (the With below implements a general, n-way union implementation eases the implemen- Q4 is passed into tuple gen- A Holistic Aggregate directly translates intoaggregate func- T his semantics Functions. For some an implementation that is shown in a synthesizer to generate the actual erated thisassume thatmeaningful if both is 2): tuples were of other functionality. (q); way onlyfor],the F P G Afour windows (b1ADVANCE 1 TUPLES is the latency of input tation logic (we saw a similar circuit a moment ago): 4 ,b2 ware )configuration a logical ‘and’ gate to combine them: in the state required is not within constant bounds. They at most . T he synthesizer op- tions, valid. Hence, we use together at any point in effectively implementing pro- zes out “dangling wires”, time. Hence, we in- on pushdown wsum sub-plan. The window type copies of the for free. 4.5 to batch (parts of) their input until the aggregate can need Windowing be computed when the end of the stream is seen. The pro- a is q ⇒ is tuple-based. −2 a hereq1 no2 actual The &to do at runtime (though fields The concept,b2for such operators are the computation of work ‘counter’ component .on (Concat) totype example )windowing bridges the gap between stream ○a:(b1 of (q) ⇒ = = . (Equals) −2 z z propagated tuples and sends the adv signal as s incoming into a new set of registers). L atency and ∗ . ∗ processing most relational-style Our weighted∗ medians or and frequent items. semantics. The operator b1 b2 sum operation fied by the query’s ADVANCE2clause (in this par-both 1.1 rate of this implementation for pro q b1 q 1 b ∗ jection are 2 q wsum x|k,l q2 consumes but needs toof qits left-hand the last behaves similarly, the output remember only sub-plan DVANCE = 1 and we could simplify our circuit (q1 ) and slices it into ause of flip-flops is For each window, four input tuples. The set of windows. a good choice to Arithmetics and adv line). q Boolean Operations uting data valid togate easily finishes within a 2 Again, the ‘and’ the single cycle. Most simplequantities of data. Here weof the will run within hold such small arithmetic or Boolean operations right-hand x|k,l invokes a parameterized execution can use them in Hence, latency and issue rate are both 1. ○a:(b ,b ) op- shiftsingle(x), mode, each occurrence of x replacedalways indicated in Table 1, we use the generic sub-plan q2 clock cycle. More complex tasks, buffer asby the a a register with such that the operator such multipli- 1 2 lection and Projection or to represent arithmetic computations, value compar- cation / division, or floating-point arithmetics, may require essing in the windowing part of the plan is im- contains the last four input values. 5. AUXILIARY COMPONENTS or Boolean connectives in relational plans. T he in-
  • 12. Examples πPrice,Volume Price Volume ∗ q1 x|k,l q2 (x) ⇒ n-way union σa adv & a ∗ πPrice,Volume Price Volume ∗ 0 0 q1 x|k,l 0 2 (x) q ⇒ 1 CSR2 a = ○a:(Symbol,"UBSN") = & & & & n-way union Symbol ∗ σa adv ∗ ∗ ∗ ∗ & "UBSN" a ∗ Trades eos 0 q2 eos 0 q2 eos q2 eos 0 1 q2 CSR2 a & & & & = ○a:(Symbol,"UBSN") = Figure 4: Compiled hardware execution plan for & & & & Symbol this ∗ Query Q1 . Latency of "UBSN" circuit is 3, issue rate 1. ∗ ∗ ∗ ∗ 1 1 0 1 CSR1 eos q2 eos q2 eos q2 eos q2 Trades & && ∗ & ll sub-plans have an issue rate of one, this is also the rate Figure 4: Compiled hardware execution plan for q1 f the complete plan. 2 (Wind) Query Q1 . Latency of this circuit is 3, issue rate 1. 1 1 0 1 CSR1 4.4 Union Figure 5: Compilation rule for windowing operator From a data flow pointissue rate theone, thisan also the rate ∗ all sub-plans have an of view, of task of is algebraic (shown for an instance with at most three windows of operator ∪ plan. union the completeis to accumulate the output of several 2 open in parallel). q1 (Wind) ource streams into a single output stream. Since, in our 4.4 Union ase, all source streams operate truly in parallel, a hardware Figure 5: Compilation rule for windowing operator From a data ∪ needs of view, proper of an algebraic mplementation forflow pointto ensure the tasksynchroniza- (shown for an instance with at most three windows ion. We do so by buffering all input ports using FIFOs:several current window. Sub-plan q2 thus sees a finite input for union operator ∪ is to accumulate the output of open in parallel). every execution and may, e.g., use aggregation in a seman- source streams into a single output stream. Since, in our tically sound manner. 4-way union case, all source streams operate truly in parallel, a hardware FIFOs Our compiler implements this semantics by wrapping q2 implementation for ∪ needs to ensure proper synchroniza-
  • 13. Examples 5-way union aggregation fun Algebraic Ag 0 0 1 0 0 CSR2 braic aggregate & & & & & of state) [9] in WPrice WPrice WPrice WPrice WPrice coun t , e.g., we eos wsum eos wsum eos wsum eos wsum eos wsum its trigger inpu Once we reach & & & & & the counter va the counter to 1 0 1 1 1 CSR1 In the transl counter ∗ adv & σa coun t a (q) a ∗ a = ○a:(Symbol,"UBSN") = Symbol "UBSN" ∗ we forward the to implement Trades input of the co constructs a ne Figure 6: Hardware execution plan for Query Q4 . lar input value aggregate (we
  • 14. Examples q1 grpx|c q2 ( x ) ⇒ for instan pressed u n -way union shown her eos on the rig ∗ ∗ ∗ ∗ ware side eos q2 q2 q2 q2 q1 q2 is addr implemen DEMUX rect the s data rst put fields CAM put regist data c ∗ erated thi q1 (GrpBy) valid. Hen Figure 7: Compilation rule to implement the ‘group q1 q2 by’ operator grp. causes a noticeable effect on the overall chip space consump- tion. Again,
  • 15. if data volumes become high. In this case, the data is writ- If no componen ten into (ex ternal) memory, where logic on the F P G A picks bound (such as a Streaming De-Multiplexing it up autonomously after it has received a work request from the host C P U . plan optimizer ca and run (part of ) idea to the plan f 5.3 Stream De-Multiplexing our actual implem A ctual implementations may depend on specialized func- tionality that would be inefficient to express using standard algebra components. In our use case, algorithmic trading, & input data is received as a multiplexed stream, encoded in a = compressed variant of the F I X protocol [4]. E xpressed us- Symbol "UB ing the Stream B ase syntax, the multiplex stream contains actual streams like Trade CREATE INPUT STREAM NewOrderStream ( A s shown in th MsgType byte, -- 68: new order plan now runs bo ClOrdId int, -- unique order identifier and finishes with OrdType char, -- 1:market, 2:limit, 3:stop T he most appa Side char, -- 1:buy, 2:sell, 3:buy minus tion of latency. T TransactTime long) -- UTC Timestamp of one. In additio CREATE INPUT STREAM OrderCancelRequestStream ( sources, primarily MsgType byte, -- 70: order cancel request before. ClOrdId int, -- unique order identifier OrigClOrdId int, -- previous order 6.2 Increasin Side char, -- 1:buy, 2:sell, 3:buy minus T he elimination TransactTime long) -- UTC Timestamp cally leads to task hardware circuit We have implemented a stream de-multiplexer that in- terprets the MsgType field (first field in every stream) and V dispatches the tuple to the proper plan part. &
  • 16. Optimization Heuristics Reducing Clock Synchronization Increasing Parallelism Trading Unions for Multiplexers Group By/Windowing Unnesting
  • 17. Evaluation FPGA software (Linux 2.6) high pack packets processed 100 % tuples. 100 % 100 % 80 % Our res 60 % the expec 60 % attractive 40 % 36 % is used as the remain 20 % load. Thi 0% system ca 300,000 pkt/s 1,000,000 pkt/s data input rate 8. REL Figure 8: Number of packages successfully processed The ide for two given input loads. The hardware implemen- cessing da tation is able to sustain both package rates. explored w [3]. His D operated of 1 means that the circuit can process 100 million input database s While e
  • 18. Related Work Database Processing Systems as Hardware implementation “database machine” [3] tailor-made hardware for database processing Netezza Performance Server system [2] SPUs = Disk, NIC, a CPU ,and an FPGA SQL Chip [14] In Avalanche, we aim at exploiting the configura- bility of FPGAs. With Glacier, we present a compiler that compiles queries from an arbitrary workload into a tailor- made hardware circuit. Processing Model MonetDB/X100 system[13] (is resembled) Specialized processors for use in a database context GPUTeraSort[8] stream joins for the Cell processor[6]

Notas del editor