Near real-time technology choices for continuous analytics

N..e.ar ..re.
..ana.
tec....hnolo...gy choi...ce

✦

pavlo.baron@codecentric.de

✦

@pavlobaron

Wile E. Coyote
✦

pretty slow

✦

running on own demand

✦

very wide ﬁeld of vision

✦

very long memory

✦

purely proactive

✦

✦

thoroughly analysing and
preparing

always loses

Road Runner
✦

hell fast

✦

ever running

✦

very narrow ﬁeld of vision

✦

very short memory

✦

purely reactive

✦

✦

forced to immediately
decide

always wins

Coyote: slow
✦

✦

✦

too much mumbo-jumbo,
too many tools, totally
dependent on ACME

needs a complex, partially
distributed setup

complex decisions,
depending on Runner,
weather, environment etc.

Runner: fast
✦

✦

✦

zero hoo-ha, zero
tools, just own
body

road bound

simple decisions
like run | halt | step
aside | beep beep

Coyote: ofﬂine
✦

✦

mostly stands around,
observing and planning

only sprints on demand,
when Runner passes by

Runner: non-stop
✦

✦

never stops fully, just
occasionally halts for food
and to fool Coyote

continuously runs the road
in search for food

Coyote: wide vision
✦

✦

sees the whole environment

tries to use the whole
environment to catch
Runner, predicting his
paths

Runner: narrow vision
✦

✦

only sees what’s in front of
his nose on the road

due to speed and short-time
predictions, feels well with
the narrow, momentary
vision

Coyote: long memory
✦

✦

as far as possible, learns
from previous failures

continuously improves
tricks to catch Runner

Runner: short memory
✦

✦

ultimate carpe diem

predicts Coyote’s actions in
last minute, avoiding being
harmed right before the fact

Coyote: proactive

✦

plans and tries out, looks
for new ways to catch
Runner

Runner: reactive

✦

doesn’t plan, just reacts on
Coyote’s actions

Coyote: thorough
✦

✦

thoroughly analyses the
situation

throughly plans ahead,
prepares for one single shot

Runner: spontaneous
✦

✦

decides immediately and
spontaneously, depending
on what Coyote does

makes the best immediate
decision to achieve the
highest level of Coyote
fooling

Coyote: loses
✦

✦

no matter how hard he
tries, he’s never fast or
savvy enough to catch
Runner

never gives up though

Runner: wins
✦

✦

doesn’t even try to win, but
always does thanks to speed
and immediate situation
analysis, followed by
reaction. Also, due to
Coyote’s continuous failure

every time has fun fooling
Coyote

Coyote is batch.
Runner is near realtime.

Batch (analytics)
✦

✦

✦

✦

is when you have plenty of
time for analysis

is when you explore
patterns and models in
historic data

is when you try to ﬁt any
sort of data into a
hypothetic model

is when you plan and
forecast the future instead
of (re)acting immediately

Batch (architecture)
✦

✦

✦

✦

is when you
(synchronously) query
previously stored data

is when you use main
memory primarily for
temporary caches

is when you do ETL and
alike, even on Hadoop’s
rails

is when you split large
amounts of historic data in
smaller portions for
distributed / parallel
analysis

Batch (technology)
✦

✦

✦

is when you build on
(R)DBMS or (softschema) NoSQL data
stores in a classic way

is when you store in HDFS
and process with Hadoop &
Co.

is when you generally rely
on disks / storage

Near realtime (analytics)
✦

✦

✦

✦

is when you don’t have time

is when you analyse data as
it comes

is when you already have a
fixed model, and data flying
in fits it 100%

is when you (re)act
immediately, based on
patterns you learned online
and in the batch analysis

Near realtime (architecture)
✦

✦

✦

✦

is when you don’t query
data, but expect / assume it

is when you use main
memory as primary data
storage

is when you process event
streams

is when you distribute and
parallelise only independent
computations (it’s hairy
enough even on one
machine - explicit loop
tiling, skewing etc.)

Near realtime (technology)
✦

✦

✦

✦

✦

is when you build on
DSMS, event processing
systems and alike

is when you store (almost)
only for archiving reasons

is when you don’t hit disks
or speak of “storage”

is when you do your best to
avoid horizontal network
gossip

is when you must go for
accelerators such as GPUs
in case of complex math

Near realtime - non-stop,
immediate analytics cannot
be done as / in batch.

Near realtime is tricky
✦

✦

✦

✦

✦

you need to build event-driven, non-blocking,
lock-free, reactive programs (buzzword
award!)

you need to work time-bound, penalising or
compensating late events

you need to keep everything (sliced, autoexpiring) in main memory

you need to completely utilise resources of one
single machine (speaking of mechanical
sympathy), without waste

you need to ﬁx your model and work with
ﬁxed-size (binary) events

Scaling near realtime
✦

✦

✦

✦

scaling near realtime analytics is pretty hard.
Similar challenges parallelising on one
machine or scaling out in a distributed way

you scale through logical or physical stream
splitting, online scatter-gather and alike

you keep distributed / parallel computation
independent, until you have to merge in the
next processing stage. And so on.

you scale through receive-and-forward, ﬁreand-forget, cascading, pipelining, multicast,
redundant (who’s ﬁrst, role-based etc.)
processing

Surviving near realtime
✦

✦

✦

building a restlessly eventoriented, in-memory analytics
system brings some challenges

disaster recovery: yet again,
splitting streams (for storage),
redundant (role-based)
computation

short-term failure recovery: upfront temporary, auto-expiring
storage, auto-replay or penalising
events

Near realtime is limited
✦

✦

✦

you need to run most of
analytics on event windows
of some size

you switch from exact to
probabilistic / approximate
results

you can only predict near
future, cluster based on
relatively short time periods
and recognise short-term
patterns and anomalies only

Near realtime mining
✦

✦

✦

✦

you mine live streams instead of
passive data sources

typical algorithms such as
Apriori, 1-class-SVM, k-means,
regressions etc. are easily
possible, but on stream portions
only

NLP can be done by giving
words identiﬁers and dealing
with binary messages instead of
text

as long as it ﬁts into main
memory, it’s comparable to
classic mining, but is much faster

Near realtime + batch?
✦

✦

✦

the combination of both is
what can make a winning
solution. Example reference
architecture: Lambda, but
it’s even more

exploratory, ofﬂine
analytics, baseline analysis,
pattern mining, algorithm
training and alike you do in
the batch

you apply batch analytics’
results to near realtime and
prove or reject hypothesis’,
detect anomalies, run
forecasts, derive trends etc.

Near realtime, no batch?
✦

✦

✦

✦

it’s possible to do some of this
completely without batch, just
on streams - even more than
basic counters and stats

you need to keep every single
historic event in a data store

you need to replay historic
events instead of querying /
mining your data store

don’t query your database - let
the database stream what it has
to you

Near realtime example tools
✦

✦

✦

✦

✦

query/store-oriented/passivelyadapting: Spark/Shark, Impala,
Drill, ParStream, Splunk

full-blown CEP engines /
continuous querying DSMSs:
Esper, TIBCO/StreamBase

more pragmatic stream
processors: Storm, S4, Samza

event-oriented, continuous
analysers: keen.io, also
speaker’s current WIP

etc. etc. etc...

Near realtime - DIY
✦

✦

✦

✦

in the end, you’ll have to build it (or core
parts of it) yourself

you’ll have to work with circular / ring
buffers and / or zero-overhead queuing
software: Disruptor, 0MQ

ideally, you keep everything in one single
OS process - multi-threading is still hairy
enough then

managing and using machine’s overall
memory is the tricky part

✦

for GPUs: OpenCL, Rootbeer

✦

embed analytics / statistics into the process

Near realtime - DIY
✦

✦

✦

✦

✦

✦

picking the basis platform has less to do with the
personal ﬂavour than with what it offers

C is a good and a valid choice, but very “manual”

Erlang/OTP is great for glue, but hard for analytics
and integration. In the end, it’s C, but pretty tricky
here

Node.js is C in the end at this point, but it’s not for
single-process / multi-threading and still maturing

JVM is a good compromise. Managed / GCcontrolled memory with object wrappers will be
sacriﬁced for off-heap memory with primitives though

Most of the rest doesn’t apply for this sort of tasks

Near realtime - DIY

✦

✦

✦

✦

✦

programming paradigms and thus
languages are the essential, secret sauce

functional programming is ideal for
analytics and event-processing

(functional) reactive programming,
Reactor (as pattern or framework), RX
are good for building this sort of
systems

JavaScript is partially there, Erlang,
Clojure, Scala & Co. are further, but can
be uncontrollable in runtime behaviour

pure Java can be (later) a healthy tradeoff though - now with RX or Reactor,
Netty etc.

Time in near realtime
✦

✦

✦

✦

✦

realtime still means real time, even if “near”

the platform of your choice might not be ideal
for hard or soft realtime, since the difference is
primarily in what happens with late events and
under high load

Erlang will do its best to trigger a timer. Same
with Node.js. But they don’t interrupt hard, are
scheduling on their own and thus leaving you
with an approximation

JVM comes close, but still no easy way to
interrupt explicitly. Alternative: Hashing
Wheel, own scheduler on dedicated core

C is the winner, OS-support essential (RTOS
alike)

Near realtime + data store?
✦

✦

✦

✦

near realtime analytics systems need to
store data in different stages: shortterm replay, disaster protection, history

the trick is to turn around the way you
work with the data store

your data store knows model and
queries beforehand, and only waits for
events to start streaming historic data
satisfying the static query / view

most NoSQL stores, but also classic
RDBMS have implantable workers /
jobs / coprocessors as built-in feature:
Oracle, Riak, HBase etc.

Near realtime business cases
✦

✦

anomaly / novelty / outlier detection in
any sort of system

fraud, attack detection based on
patterns

✦

situational pricing, product placement

✦

stock, inventory control and forecast

✦

online bidding, trading

✦

automated trafﬁc optimization

✦

semi-automated operations

✦

immediate visualization and tracing

Why speed?
✦

✦

✦

✦

why be slow if it’s possible, with
comparable effort, to be fast in
making decisions and automating
them? If not you, then your
competitor

since everybody can mine data,
speed and quality are the only
technical success factors left

it’s about how fast you can decide
based on data. The best way is to
start very early, at the source of data

“new economy” is all about speed,
not (only) lobbies

✦

cartoon images found on the
internet and are directly or
indirectly property/copyright of
or related to Time Warner

Near real-time technology choices for continuous analytics

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (11)

Similar a Near real-time technology choices for continuous analytics

Similar a Near real-time technology choices for continuous analytics (20)

Más de Pavlo Baron

Más de Pavlo Baron (15)

Último

Último (20)

Near real-time technology choices for continuous analytics