The document compares and contrasts the characteristics of Wile E. Coyote and Road Runner from the Looney Tunes cartoons. It describes Coyote as slow, offline, having a wide field of vision, long memory, being proactive, thorough, and always losing to Road Runner. Road Runner is described as fast, always running, having a narrow field of vision, short memory, being reactive, spontaneous, and always winning against Coyote. The document then discusses the differences between batch and near real-time analytics and technologies.
3. Wile E. Coyote
✦
pretty slow
✦
running on own demand
✦
very wide field of vision
✦
very long memory
✦
purely proactive
✦
✦
thoroughly analysing and
preparing
always loses
4. Road Runner
✦
hell fast
✦
ever running
✦
very narrow field of vision
✦
very short memory
✦
purely reactive
✦
✦
forced to immediately
decide
always wins
5. Coyote: slow
✦
✦
✦
too much mumbo-jumbo,
too many tools, totally
dependent on ACME
needs a complex, partially
distributed setup
complex decisions,
depending on Runner,
weather, environment etc.
6. Runner: fast
✦
✦
✦
zero hoo-ha, zero
tools, just own
body
road bound
simple decisions
like run | halt | step
aside | beep beep
8. Runner: non-stop
✦
✦
never stops fully, just
occasionally halts for food
and to fool Coyote
continuously runs the road
in search for food
9. Coyote: wide vision
✦
✦
sees the whole environment
tries to use the whole
environment to catch
Runner, predicting his
paths
10. Runner: narrow vision
✦
✦
only sees what’s in front of
his nose on the road
due to speed and short-time
predictions, feels well with
the narrow, momentary
vision
11. Coyote: long memory
✦
✦
as far as possible, learns
from previous failures
continuously improves
tricks to catch Runner
16. Runner: spontaneous
✦
✦
decides immediately and
spontaneously, depending
on what Coyote does
makes the best immediate
decision to achieve the
highest level of Coyote
fooling
17. Coyote: loses
✦
✦
no matter how hard he
tries, he’s never fast or
savvy enough to catch
Runner
never gives up though
18. Runner: wins
✦
✦
doesn’t even try to win, but
always does thanks to speed
and immediate situation
analysis, followed by
reaction. Also, due to
Coyote’s continuous failure
every time has fun fooling
Coyote
20. Batch (analytics)
✦
✦
✦
✦
is when you have plenty of
time for analysis
is when you explore
patterns and models in
historic data
is when you try to fit any
sort of data into a
hypothetic model
is when you plan and
forecast the future instead
of (re)acting immediately
21. Batch (architecture)
✦
✦
✦
✦
is when you
(synchronously) query
previously stored data
is when you use main
memory primarily for
temporary caches
is when you do ETL and
alike, even on Hadoop’s
rails
is when you split large
amounts of historic data in
smaller portions for
distributed / parallel
analysis
22. Batch (technology)
✦
✦
✦
is when you build on
(R)DBMS or (softschema) NoSQL data
stores in a classic way
is when you store in HDFS
and process with Hadoop &
Co.
is when you generally rely
on disks / storage
23. Near realtime (analytics)
✦
✦
✦
✦
is when you don’t have time
is when you analyse data as
it comes
is when you already have a
fixed model, and data flying
in fits it 100%
is when you (re)act
immediately, based on
patterns you learned online
and in the batch analysis
24. Near realtime (architecture)
✦
✦
✦
✦
is when you don’t query
data, but expect / assume it
is when you use main
memory as primary data
storage
is when you process event
streams
is when you distribute and
parallelise only independent
computations (it’s hairy
enough even on one
machine - explicit loop
tiling, skewing etc.)
25. Near realtime (technology)
✦
✦
✦
✦
✦
is when you build on
DSMS, event processing
systems and alike
is when you store (almost)
only for archiving reasons
is when you don’t hit disks
or speak of “storage”
is when you do your best to
avoid horizontal network
gossip
is when you must go for
accelerators such as GPUs
in case of complex math
26. Near realtime - non-stop,
immediate analytics cannot
be done as / in batch.
27. Near realtime is tricky
✦
✦
✦
✦
✦
you need to build event-driven, non-blocking,
lock-free, reactive programs (buzzword
award!)
you need to work time-bound, penalising or
compensating late events
you need to keep everything (sliced, autoexpiring) in main memory
you need to completely utilise resources of one
single machine (speaking of mechanical
sympathy), without waste
you need to fix your model and work with
fixed-size (binary) events
28. Scaling near realtime
✦
✦
✦
✦
scaling near realtime analytics is pretty hard.
Similar challenges parallelising on one
machine or scaling out in a distributed way
you scale through logical or physical stream
splitting, online scatter-gather and alike
you keep distributed / parallel computation
independent, until you have to merge in the
next processing stage. And so on.
you scale through receive-and-forward, fireand-forget, cascading, pipelining, multicast,
redundant (who’s first, role-based etc.)
processing
29. Surviving near realtime
✦
✦
✦
building a restlessly eventoriented, in-memory analytics
system brings some challenges
disaster recovery: yet again,
splitting streams (for storage),
redundant (role-based)
computation
short-term failure recovery: upfront temporary, auto-expiring
storage, auto-replay or penalising
events
30. Near realtime is limited
✦
✦
✦
you need to run most of
analytics on event windows
of some size
you switch from exact to
probabilistic / approximate
results
you can only predict near
future, cluster based on
relatively short time periods
and recognise short-term
patterns and anomalies only
31. Near realtime mining
✦
✦
✦
✦
you mine live streams instead of
passive data sources
typical algorithms such as
Apriori, 1-class-SVM, k-means,
regressions etc. are easily
possible, but on stream portions
only
NLP can be done by giving
words identifiers and dealing
with binary messages instead of
text
as long as it fits into main
memory, it’s comparable to
classic mining, but is much faster
32. Near realtime + batch?
✦
✦
✦
the combination of both is
what can make a winning
solution. Example reference
architecture: Lambda, but
it’s even more
exploratory, offline
analytics, baseline analysis,
pattern mining, algorithm
training and alike you do in
the batch
you apply batch analytics’
results to near realtime and
prove or reject hypothesis’,
detect anomalies, run
forecasts, derive trends etc.
33. Near realtime, no batch?
✦
✦
✦
✦
it’s possible to do some of this
completely without batch, just
on streams - even more than
basic counters and stats
you need to keep every single
historic event in a data store
you need to replay historic
events instead of querying /
mining your data store
don’t query your database - let
the database stream what it has
to you
34. Near realtime example tools
✦
✦
✦
✦
✦
query/store-oriented/passivelyadapting: Spark/Shark, Impala,
Drill, ParStream, Splunk
full-blown CEP engines /
continuous querying DSMSs:
Esper, TIBCO/StreamBase
more pragmatic stream
processors: Storm, S4, Samza
event-oriented, continuous
analysers: keen.io, also
speaker’s current WIP
etc. etc. etc...
35. Near realtime - DIY
✦
✦
✦
✦
in the end, you’ll have to build it (or core
parts of it) yourself
you’ll have to work with circular / ring
buffers and / or zero-overhead queuing
software: Disruptor, 0MQ
ideally, you keep everything in one single
OS process - multi-threading is still hairy
enough then
managing and using machine’s overall
memory is the tricky part
✦
for GPUs: OpenCL, Rootbeer
✦
embed analytics / statistics into the process
36. Near realtime - DIY
✦
✦
✦
✦
✦
✦
picking the basis platform has less to do with the
personal flavour than with what it offers
C is a good and a valid choice, but very “manual”
Erlang/OTP is great for glue, but hard for analytics
and integration. In the end, it’s C, but pretty tricky
here
Node.js is C in the end at this point, but it’s not for
single-process / multi-threading and still maturing
JVM is a good compromise. Managed / GCcontrolled memory with object wrappers will be
sacrificed for off-heap memory with primitives though
Most of the rest doesn’t apply for this sort of tasks
37. Near realtime - DIY
✦
✦
✦
✦
✦
programming paradigms and thus
languages are the essential, secret sauce
functional programming is ideal for
analytics and event-processing
(functional) reactive programming,
Reactor (as pattern or framework), RX
are good for building this sort of
systems
JavaScript is partially there, Erlang,
Clojure, Scala & Co. are further, but can
be uncontrollable in runtime behaviour
pure Java can be (later) a healthy tradeoff though - now with RX or Reactor,
Netty etc.
38. Time in near realtime
✦
✦
✦
✦
✦
realtime still means real time, even if “near”
the platform of your choice might not be ideal
for hard or soft realtime, since the difference is
primarily in what happens with late events and
under high load
Erlang will do its best to trigger a timer. Same
with Node.js. But they don’t interrupt hard, are
scheduling on their own and thus leaving you
with an approximation
JVM comes close, but still no easy way to
interrupt explicitly. Alternative: Hashing
Wheel, own scheduler on dedicated core
C is the winner, OS-support essential (RTOS
alike)
39. Near realtime + data store?
✦
✦
✦
✦
near realtime analytics systems need to
store data in different stages: shortterm replay, disaster protection, history
the trick is to turn around the way you
work with the data store
your data store knows model and
queries beforehand, and only waits for
events to start streaming historic data
satisfying the static query / view
most NoSQL stores, but also classic
RDBMS have implantable workers /
jobs / coprocessors as built-in feature:
Oracle, Riak, HBase etc.
40. Near realtime business cases
✦
✦
anomaly / novelty / outlier detection in
any sort of system
fraud, attack detection based on
patterns
✦
situational pricing, product placement
✦
stock, inventory control and forecast
✦
online bidding, trading
✦
automated traffic optimization
✦
semi-automated operations
✦
immediate visualization and tracing
41. Why speed?
✦
✦
✦
✦
why be slow if it’s possible, with
comparable effort, to be fast in
making decisions and automating
them? If not you, then your
competitor
since everybody can mine data,
speed and quality are the only
technical success factors left
it’s about how fast you can decide
based on data. The best way is to
start very early, at the source of data
“new economy” is all about speed,
not (only) lobbies
42. ✦
cartoon images found on the
internet and are directly or
indirectly property/copyright of
or related to Time Warner