Set this Big Data technology zoo in order (@pavlobaron)

Set this “Big Data”
technology zoo in order
Wednesday, April 24, 13

Pavlo Baron, codecentric AG
pavlo.baron@codecentric.de
@pavlobaron

First, let’s look
at speed

realtime = time bound

realtime = realtime AND
latency penalty

NOT ﬁrst hitting
(spinning) disk

<= layer 5, ideally direct
channel / bus

zero-copy

NOT moving between
user and kernel space

NOT parsing or
reformatting data

NOT explicit queueing

NOT reestablishing
connections

padding and cache-level(s)
optimization

circular buffers and (other)
non-blocking data
structures/algorithms

push instead of pull,
especially for outside
data

NOT distributed with
horizontal dependencies

low-level programming with
minimal abstraction

near realtime = NOT
realtime

near realtime = near
realtime AND message
or event oriented

realtime AND window
oriented with ﬁxed time/
size

realtime AND doesn’t
leave (main) memory
AND
keep lookups in memory

near realtime = near realtime
AND mostly CPU-less I/O

realtime AND NOT has
explicit, rich, self
describing model

realtime AND ﬁlter
upfront

fast = NOT near realtime

fast = fast AND NOT query or
search instead of hash-/offset
based access

fast = fast AND NOT
combine data sources

fast = fast AND NOT
synchronize or block

fast = fast AND NOT
synchronously prevent
disaster through
redundancy

fast = fast AND NOT I/O
wait, especially
(spinning) disk

fast = fast AND NOT
100% exact instead of
probabilistic or guessing

fast = fast AND
eventually refuse access
to prevent contention or
swapping

fast = fast AND generally
limiting access
frequency

batch = slow = NOT fast

oh, and what about “big”
in Big Data?

“big” as in “highly frequent”:
near realtime < big < batch

“big” as in “whole lotta”:
fast < big <= batch

“big” as in “very chaotic”:
fast < big <= batch

“big” as in “analytically
complex”:
fast < big <= batch

“big” as in “widely spread”:
fast < big <= batch

So, it looks like it’s almost
always between fast and
batch, but...

dilemma: the more
realtime, the closer to
the bare metal and wires

dilemma: the more
realtime, the less
machines and
distribution

dilemma: the more
complex the analytical
part, the less realtime

dilemma: the more
chaotic the data, the less
realtime

dilemma: the bigger the
data, the more garbage
in it

Now let me shock you

the real dilemma:
business wants near
realtime, but without
penalties or data loss,
with endless scalability,
zero-latency and 100%
consistency

WTF???

Data that’s not immediately
turned into useful
information and thus value is
only of archaeologic,
accounting- or compliance-
relevant or even algorith-
training interest

The true market advantage
of the future depends on how
close to near realtime you are
gaining useful information
out of your live data

HFT people will laugh about it

But Big Data people need to
learn from them

Before you can use data, you
ﬁrst need to be able to take
data. So, let’s consider en
example how to take data
real fast

data = <some optimized
binary that ideally ﬁts into
one single MTU of the
underlying protocol(s)>

application.onAnyChange:
sendEvent(data)

balancer.onEvent(data):
balanceZeroCopy(data)

asyncListener.onData(data):
asyncStore(data),
asyncProcess(data)

So, now you can take a solid
amount of data. Let’s look at
the processing

Massively parallel
computations on incoming
data move you closer to near
realtime

HPC people will laugh about it

Go with message/event
orientation, VMs with native
support for them, or similar
on platforms you probably
didn’t think it’s possible on

processor = <CPU core/GPU
core(s) bound active,
algorithmically trained
component>

processor.onData(data):
result = analyze(data),
queueResult(result)

result = <some optimized
binary that ideally ﬁts into
one single MTU of the
underlying protocol(s)>

OK, you can process and
queue results for whoever
listens to them (semi-time-
critical, lower-level queue).
Now how to store real fast a
lot of data like this?

There is no such thing as
high-performance, high-
load-capable, high-scale,
multi-purpose, rich
model, absolutely
reliable and 100%
consistent database

database != data store

Classic databases and even
NoSQL data stores, for
different reasons, sometimes
tend to lose their original
intention / focus

NewSQL world aims to
solve the scale-up
limitations of RDBMS
through distribution
while still guaranteeing
ACIDish transactions

New ones arrive
every day

Let’s look behind the
facade

You can be real fast just
spilling data block-wise
to the disk through
DMA, but beware of
caches

You’ll be a bit slower
with an in-memory,
journaling K/V store -
but beware of weak
storage reliability

You will be slower, but win
reliability (and redundancy if
you wish) when you go with a
column-oriented or K/V,
natively distributed and
masterless store - as model-
agnostic as possible

But you need to be aware
that to make such a
store real fast, you’ll
have to turn a lot of
infrastructural nobs
before your data even
hits the store

OK, now it’s in the store,
though you probably
didn’t need to store it.
But what if you run into
the (slow) batch? How
make it faster?

Go with native, machine-
and system-close
extensions instead of
general portability

Keep it all in memory.
Memory of a distributed
system is also
distributed

Splice your pipes or go
with almost-zero-
infrastructure queues if
you mix technologies

Have the data where you
process it, don’t move it
there ﬁrst

Do (in-data-store-)
map/reduce with 100%
data locality

Avoid (heavyweight)
abstractions

And what about Big Data
appliances?

Appliances can be fast,
real fast

But you’re slow if you
don’t give them data as it
comes

And what about Big Data
Clouds or Cloud in
general?

Clouds can be fast, real
fast. If you can afford it

And you’re slow if you
don’t give them data as it
comes

There is no single tool around
that will do your Big Data

Everything that makes you
faster - from hardware over
kernel tweaks and network
optimization to direct
memory access and minimal-
abstraction code are your
friends

When you don’t need to
retrieve or search, you win

It’s all about speed. Size
doesn’t matter a lot

Do we need zoos?

Set this Big Data technology zoo in order (@pavlobaron)

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (16)

Similar to Set this Big Data technology zoo in order (@pavlobaron)

Similar to Set this Big Data technology zoo in order (@pavlobaron) (20)

More from Pavlo Baron

More from Pavlo Baron (16)

Recently uploaded

Recently uploaded (20)

Set this Big Data technology zoo in order (@pavlobaron)