The Briefing Room with Dr. Robin Bloor and Infobright
Live Webcast Dec. 17
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?AT=pb&SP=EC&rID=7950017&rKey=9b6b134099af5b46
How big a role will Big Data play in the future of analytics? There’s no question that all flavors of Big Data are here to stay, especially the rising waters of machine-generated data. Cramming all the details you want into a giant data warehouse will no longer be tenable, which means other, more federated solutions must arise. That’s where alternate query models will save the day.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor as he explains how the widening landscape of Big Data will continue to transform the manner in which analytics are done. He’ll be briefed by Don DeLoach of Infobright, who will tout his company’s Big Data strategy, which focuses on greatly expediting the process of doing analysis on large sets of federated data.
Visit InsideAnalysis.com for more information
4. Mission
! Reveal the essential characteristics of enterprise software,
good and bad
! Provide a forum for detailed analysis of today s innovative
technologies
! Give vendors a chance to explain their product to savvy
analysts
! Allow audience members to pose serious questions... and get
answers!
Twitter Tag: #briefr
The Briefing Room
5. Topics
This Month: INNOVATORS
January: ANALYTICS
February: BIG DATA
2014 Editorial Calendar at
www.insideanalysis.com/webcasts/the-briefing-room
Twitter Tag: #briefr
The Briefing Room
6. Data Discovery & Visualization
INNOVATORS
Twitter Tag: #briefr
The Briefing Room
7. Analyst: Robin Bloor
Robin Bloor is
Chief Analyst at
The Bloor Group
robin.bloor@bloorgroup.com
Twitter Tag: #briefr
The Briefing Room
8. Infobright
! Infobright’s columnar database is used for applications and
data marts that analyze large volumes of machinegenerated data
! It leverages patented compression and optimization
techniques, and a “knowledge grid,” to achieve real-time
analytics
! Infobright offers a commercial version of its software, as
well as a freely-available, open source product
Twitter Tag: #briefr
The Briefing Room
9. Guests: Don DeLoach and Jeff Kibler
Don DeLoach is CEO and
President of Infobright
Jeff Kibler is Senior Technical
Architect for Infobright
Twitter Tag: #briefr
The Briefing Room
11. About Infobright
§ 400+
direct
and
OEM
customers
across
North
America,
EMEA
and
Asia
§ 1,000
installa:ons
§ 8
of
Top
10
Global
Telecom
Carriers
use
Infobright
via
OEM/ISVs
Logis;cs,
Manufacturing,
Business
Intelligence
Online
&
Mobile
Adver;sing/Web
Analy;cs,
eCommerce,
Social
Networks
Government,
U;li;es,
Research
Financial
Services
Telecom,
Security
12. Core Competencies
Columnar
Database
Intelligence,
not
Hardware
Administra:ve
Simplicity
Designed
for
fast
analy:cs
Knowledge
Grid
No
manual
tuning
Deep
data
compression
Itera:ve
Engine
Minimal
ongoing
administra:on
13. Machine-Generated Data Is Everywhere
§ Weblogs
§ Computer,
network
events
§ Call
detail
records
§ Financial
trade
data
§ Sensors,
RFID
§ Online
game
data
Businesses
need
to
extract
insight
in
near-‐real
;me
from
rapidly
growing
data
volume:
• Segment
and
target
website
visitors
• Troubleshoot
networks
• Iden7fy
security
threats
and
fraud
• Op7mize
online/mobile
ads
15. Emerging Data Analytics Stack:
Days of One-Size-Fits-All Are Gone
“Yesterday’s
BI-‐ETL-‐EDW
stack
is
wrong-‐sided
for
tomorrow’s
needs,
and
quickly
becoming
irrelevant.”
Gigamon
§ Data
management
§ Hadoop
transforming
this
area
§ Transparent
analy:c
stack
§ Opera:onal,
inves:ga:ve,
predic:ve
§ Machine-‐generated,
text
§ User
consump:on
§ Real-‐:me,
interac:ve
visualiza:on
&
query
crea:on
§ Data
Center
/
Data
Warehouse
§ Infrastructure
strategies,
op:ons
prolifera:ng
16. Infobright: Columnar Architecture
Column Orientation
Knowledge
Grid
–
sta:s:cs
and
metadata
“describing”
the
super-‐
compressed
data
Data
Packs
–
data
stored
in
manageably
sized,
highly
compressed
data
packs
Data
compressed
using
algorithms
tailored
to
data
type
Smarter
architecture
§ Load
data
and
go
§ No
indices
or
par::ons
to
build
and
maintain
§ Knowledge
Grid
automa:cally
updated
as
data
packs
are
created
or
updated
§ Super-‐compact
data
foot-‐
print
can
leverage
off-‐the-‐
shelf
hardware
17. The Knowledge Grid
Knowledge
Grid
Knowledge
Nodes
applies
to
the
whole
table
built
for
each
Data
Pack
Informa:on
about
the
data
Column
A
Column
A
DP1
DP2
DP3
DP4
DP5
DP6
Column
B
…
Global
knowledge
String
and
character
data
Numeric
data
Built
during
LOAD
Distribu;ons
Dynamic
knowledge
§
Knowledge
Nodes
answer
the
query
directly,
or
§
Iden:fy
only
required
Data
Packs,
minimizing
decompression,
and
§
Predict
required
data
in
advance
based
on
workload
Built
per
query
E.g.
for
aggregates,
joins
18. Optimizer / Granular Engine
1.
2.
3.
4.
Query
received
Engine
iterates
on
Knowledge
Grid
Each
pass
eliminates
Data
Packs
If
any
Data
Packs
are
needed
to
resolve
query,
only
those
are
decompressed
Query
Knowledge
Grid
Results
1%
Q:
How
are
my
sales
doing
this
year?
Compressed
Data
19. Infobright Architecture: Data Packs and Compression
Data
Packs
§ Each
data
pack
contains
65,536
data
values
§ Compression
is
applied
to
each
individual
data
pack
64K
§ The
compression
algorithm
varies
depending
on
data
type
and
distribu:on
64K
Compression
§ Results
vary
depending
on
the
distribu:on
64K
64K
Patent-‐Pending
Compression
Algorithms
of
data
among
data
packs
§ A
typical
overall
compression
ra:o
seen
in
the
field
is
10:1
§ Some
customers
have
seen
results
of
40:1
and
higher
§ For
example,
1TB
of
raw
data
compressed
10
to
1
would
only
require
100GB
of
disk
capacity
20. What Your Data Looks Like Now
Original
data
Compressed
data
10TB
50
GB
=
Avg
compression
ra:o
of
20:1
+
Knowledge
Grid
<
.5
GB
<
1%
of
compressed
data
21. Alternate Query Models: When Good Enough Works
§ “Principle
of
exactness”
the
default
for
most
data
analy:cs
and
access
systems
today
§ Using
“approximate
queries”
good
enough
answers
can
be
found
using
less
resources
§ Works
best
when
given
the
ability
to
alternate
between
approxima:on
and
exactness
in
an
easy
way
§ Crea:ng
an
interac:vity
that
accelerates
:me
to
answers
and
reduces
compu:ng
resources
required
22. Tools for Investigative Analysis
Today, Infobright provides:
§ Standard Queries: Knowledge Grid is used to
aid performance, only required data packs are
opened, retrieves exact results
§ Rough Queries: Only Knowledge Grid is used
to derive an answer quickly, typically for
analytics like SUM, AVG, MAX
23. Tools for Investigative Analysis
Fast and Informative:
§ Approximate Queries: Uses a combination of
the Knowledge Grid and Intelligent Random
Sampling to return results very quickly applicable for any type of query
§ Exact results are not important
§ Top-N type queries
§ Investigative Analytics
24. Use Case
§ Approximate Query useful when looking for data in an exploratory fashion
(e.g. anomalous events, understanding data characteristics)
§ Example: Find the “Top-10” protocols and ports extracted from event records.
§ Exact Query may take minutes, Approximate Query can answer in seconds. What’s
important is the Top-10 not necessarily the exact numbers
EXACT QUERY
DY_HR
SUM(TDR)
AP_NAME
8
14269152
DNS
8
13716936
HTTP-80
8
13527636
HTTPS-443
8
13044432
UNDEFINED
8
11486904
NO APPL PORT
8
4280412
UNDEFINED
8
2313288
HTTP-ALT-8080
8
1278876
5223
8
1214100
DNS-53
8
991560
NO APPL PORT
8
899220
XMPP-Client
APPROXIMATE QUERY
DY_HR
SUM(TDR)
AP_NAME
8
16872663
HTTP-80
8
15361320
DNS
8
14528793
HTTPS-443
8
13578984
UNDEFINED
8
11613616
NO APPL PORT
8
3659742
UNDEFINED
8
2724149
HTTP-ALT-8080
8
1427824
5223
8
1194147
DNS-53
8
1083973
NO APPL PORT
8
967579
XMPP-Client
25. Example: Online Advertising Segmentation
Approximate Queries
Traditional Queries
The goal in this example is to create a targeted campaign. They have a
minimum number of participants that have to be included in the target group
Find the top n
individuals who
meet criteria 1
Then find the top m
individuals who
meet criteria 1 and
criteria 2
This process can take a
considerable amount of time
Approximate query could dramatically
save the amount of time it takes to
determine which set of criteria they
should use
This is repeated until they are in the range that
that want to work with, and there can be up to
1500 different criteria, though they normally stop
after 7 or 8 different filters
They also have to a look at how
many individuals who are in
each permutation of the criteria.
They can (if desired) use exact queries
to calculate the exact final numbers,
instead of having to do exact queries for
all the runs.
This process can collapse an effort that takes hours into minutes or seconds
26. Big Data Analytics At the End of the Day
AD HOC
PERFORMANCE
SCALABILITY
LOAD SPEEDS
HIGH AVAILABILITY
LOW TOUCH
COMPRESSION
TCO
AFFORDABILITY
30. The Current Disposition
u
u
u
u
u
u
10 bn connected devices
13 to 14 bn new processors
embedded every year
Estimate 31 bn connected
devices by 2020
Sensors, RFID tags, DSPs,
FPGAs, CPUs, etc.
To control, alert, log and
report
Data growth at 55% pa
31. IOT Data Characteristics
u
u
u
u
u
u
Arrives in continuous streams
Generally reliable (i.e., not
in need of cleansing)
Very high volume
“Big tables” of predictably
structured data
So, very little need for ETL
activity
If “valuable” then processing
speed is likely to be critical
32. IOT Apps and Database
u
u
u
u
u
u
Mostly streaming – for alerts
and BI (analysis, discovery)
DBMS choice is a “horses for
courses” thing
If performance matters,
probably not a Hadoop app
The data structure does not
favor the prominent NoSQL
DBMSs
Traditional RDBMS will not do
well
Hence column-store
approach is most logical
33. The Coming Inversion
1. Instrument existing
(dumb) devices
2. Gather and analyze
data
3. Redesign device and
its instrumentation
from knowledge gained
4. Iterate
34. Going Forward
In terms of
DATA VOLUMES
we expect the
IOT DATA VOLUME
to swamp all other
sources of data
35. u Do
the high compression rates you achieve occur
because it is machine data, i.e., it’s a function of
the characteristics of the data?
u Is
the “approximate query” an Infobright
invention?
u How
frequently do customers use this type of
query and for what type of applications?
u Who,
typically, are the Infobright end users?
36. u What
“relationship” does Infobright favor with
Hadoop?
u What
statistical functions, if any, does Infobright
offer?
u What
does the product roadmap look like?
38. Upcoming Topics
This Month: INNOVATORS
January: ANALYTICS
February: BIG DATA
2014 Editorial Calendar at
www.insideanalysis.com/webcasts/the-briefing-room
www.insideanalysis.com
Twitter Tag: #briefr
The Briefing Room