Outside the Box: Alternate Query Models and the Future of Big Data

Grab some coffee and enjoy
the pre-show banter before
the top of the hour!

Outside the Box: Alternate Query Models & the Future of Big Data

The Briefing Room

Welcome

Host:
Eric Kavanagh
eric.kavanagh@bloorgroup.com

Twitter Tag: #briefr

The Briefing Room

Mission

!   Reveal the essential characteristics of enterprise software,
good and bad
!   Provide a forum for detailed analysis of today s innovative
technologies
!   Give vendors a chance to explain their product to savvy
analysts
!   Allow audience members to pose serious questions... and get
answers!


The Briefing Room

Topics

This Month: INNOVATORS
January: ANALYTICS
February: BIG DATA
2014 Editorial Calendar at

www.insideanalysis.com/webcasts/the-briefing-room


The Briefing Room

Data Discovery & Visualization

INNOVATORS

The Briefing Room

Analyst: Robin Bloor

Robin Bloor is
Chief Analyst at
The Bloor Group

robin.bloor@bloorgroup.com


The Briefing Room

Infobright
! Infobright’s columnar database is used for applications and
data marts that analyze large volumes of machinegenerated data
!   It leverages patented compression and optimization
techniques, and a “knowledge grid,” to achieve real-time
analytics
! Infobright offers a commercial version of its software, as
well as a freely-available, open source product


The Briefing Room

Guests: Don DeLoach and Jeff Kibler
Don DeLoach is CEO and
President of Infobright

Jeff Kibler is Senior Technical
Architect for Infobright


The Briefing Room

Turning
“Huh?”
into
“Aha!”

Alternate
Query
Models
and
Big
Data
Analy;cs

About Infobright
§  400+
direct
and
OEM
customers
across
North
America,
EMEA
and
Asia

§  1,000
installa:ons

§  8
of
Top
10
Global
Telecom
Carriers
use
Infobright
via
OEM/ISVs

Logis;cs,

Manufacturing,

Business

Intelligence

Online
&
Mobile
Adver;sing/Web

Analy;cs,
eCommerce,
Social
Networks

Government,

U;li;es,

Research

Financial
Services

Telecom,
Security

Core Competencies

Columnar

Database

Intelligence,

not
Hardware

Administra:ve

Simplicity

Designed
for

fast
analy:cs

Knowledge

Grid

No
manual

tuning

Deep
data

compression

Itera:ve

Engine

Minimal

ongoing

administra:on

Machine-Generated Data Is Everywhere
§  Weblogs

§  Computer,
network
events

§  Call
detail
records

§  Financial
trade
data

§  Sensors,
RFID

§  Online
game
data

Businesses
need
to
extract
insight
in
near-‐real
;me
from
rapidly
growing
data

volume:

•  Segment
and
target
website
visitors

•  Troubleshoot
networks

•  Iden7fy
security
threats
and
fraud

•  Op7mize
online/mobile
ads

Internet of Things is a Multiplier for EVERYTHING

Emerging Data Analytics Stack:
Days of One-Size-Fits-All Are Gone
“Yesterday’s
BI-‐ETL-‐EDW
stack
is
wrong-‐sided
for
tomorrow’s

needs,
and
quickly
becoming
irrelevant.”
Gigamon

§  Data
management

§  Hadoop
transforming
this
area

§  Transparent
analy:c
stack

§  Opera:onal,
inves:ga:ve,
predic:ve

§  Machine-‐generated,
text

§  User
consump:on

§  Real-‐:me,
interac:ve
visualiza:on
&
query
crea:on

§  Data
Center
/
Data
Warehouse

§  Infrastructure
strategies,
op:ons
prolifera:ng

Infobright: Columnar Architecture
Column Orientation

Knowledge
Grid
–
sta:s:cs
and

metadata
“describing”

the
super-‐
compressed
data

Data
Packs
–
data
stored

in
manageably
sized,

highly
compressed
data

packs

Data
compressed
using

algorithms
tailored
to

data
type

Smarter
architecture

§  Load
data
and
go

§  No
indices
or
par::ons

to
build
and
maintain

§  Knowledge
Grid

automa:cally
updated
as

data
packs
are
created
or

updated

§  Super-‐compact
data
foot-‐

print
can
leverage
oﬀ-‐the-‐
shelf
hardware

The Knowledge Grid
Knowledge
Grid

Knowledge
Nodes

applies
to
the
whole
table

built
for
each
Data
Pack

Informa:on
about
the
data

Column
A
Column
A

DP1

DP2

DP3

DP4

DP5

DP6

Column
B

…

Global
knowledge

String
and
character
data

Numeric
data

Built
during

LOAD

Distribu;ons

Dynamic
knowledge

§ 
Knowledge
Nodes
answer
the
query
directly,
or

§ 
Iden:fy
only
required
Data
Packs,
minimizing
decompression,
and

§ 
Predict
required
data
in
advance
based
on
workload

Built
per
query

E.g.
for

aggregates,
joins

Optimizer / Granular Engine
1. 
2. 
3. 
4. 

Query
received

Engine
iterates
on
Knowledge
Grid

Each
pass
eliminates
Data
Packs

If
any
Data
Packs
are
needed
to
resolve
query,
only
those
are
decompressed

Query

Knowledge
Grid

Results

1%
Q:
How
are
my

sales
doing
this

year?

Compressed
Data

Infobright Architecture: Data Packs and Compression
Data
Packs

§  Each
data
pack
contains
65,536
data
values

§  Compression
is
applied
to
each
individual
data
pack

64K

§  The
compression
algorithm
varies
depending
on
data
type
and

distribu:on

64K

Compression

§  Results
vary
depending
on
the
distribu:on

64K

64K

Patent-‐Pending

Compression

Algorithms

of
data
among
data
packs

§  A
typical
overall
compression
ra:o
seen
in

the
ﬁeld
is
10:1

§  Some
customers
have
seen
results
of
40:1

and
higher

§  For
example,
1TB
of
raw
data
compressed

10
to
1
would
only
require
100GB
of
disk

capacity

What Your Data Looks Like Now
Original
data

Compressed
data

10TB

50
GB

=

Avg
compression
ra:o
of
20:1

+
Knowledge
Grid

<
.5
GB

<
1%
of
compressed
data

Alternate Query Models: When Good Enough Works
§  “Principle
of
exactness”
the

default
for
most
data
analy:cs

and
access
systems
today

§  Using
“approximate
queries”

good
enough
answers
can
be

found
using
less
resources

§  Works
best
when
given
the

ability
to
alternate
between

approxima:on
and
exactness
in

an
easy
way

§  Crea:ng
an
interac:vity
that

accelerates
:me
to
answers
and

reduces
compu:ng
resources

required

Tools for Investigative Analysis

Today, Infobright provides:
§  Standard Queries: Knowledge Grid is used to
aid performance, only required data packs are
opened, retrieves exact results
§  Rough Queries: Only Knowledge Grid is used
to derive an answer quickly, typically for
analytics like SUM, AVG, MAX

Tools for Investigative Analysis

Fast and Informative:
§  Approximate Queries: Uses a combination of
the Knowledge Grid and Intelligent Random
Sampling to return results very quickly applicable for any type of query
§  Exact results are not important
§  Top-N type queries
§  Investigative Analytics

Use Case
§  Approximate Query useful when looking for data in an exploratory fashion
(e.g. anomalous events, understanding data characteristics)
§  Example: Find the “Top-10” protocols and ports extracted from event records.
§  Exact Query may take minutes, Approximate Query can answer in seconds. What’s
important is the Top-10 not necessarily the exact numbers
EXACT QUERY

DY_HR
SUM(TDR)

AP_NAME

8

14269152
DNS

8

13716936
HTTP-80

8

13527636
HTTPS-443

8

13044432
UNDEFINED

8

11486904
NO APPL PORT

8

4280412
UNDEFINED

8

2313288
HTTP-ALT-8080

8

1278876
5223

8

1214100
DNS-53

8

991560
NO APPL PORT

8

899220
XMPP-Client

APPROXIMATE QUERY

DY_HR
SUM(TDR)

AP_NAME

8

16872663
HTTP-80

8

15361320
DNS

8

14528793
HTTPS-443

8

13578984
UNDEFINED

8

11613616
NO APPL PORT

8

3659742
UNDEFINED

8

2724149
HTTP-ALT-8080

8

1427824
5223

8

1194147
DNS-53

8

1083973
NO APPL PORT

8

967579
XMPP-Client

Example: Online Advertising Segmentation

Approximate Queries

Traditional Queries

The goal in this example is to create a targeted campaign. They have a
minimum number of participants that have to be included in the target group
Find the top n
individuals who
meet criteria 1

Then find the top m
individuals who
meet criteria 1 and
criteria 2

This process can take a
considerable amount of time
Approximate query could dramatically
save the amount of time it takes to
determine which set of criteria they
should use

This is repeated until they are in the range that
that want to work with, and there can be up to
1500 different criteria, though they normally stop
after 7 or 8 different filters
They also have to a look at how
many individuals who are in
each permutation of the criteria.

They can (if desired) use exact queries
to calculate the exact final numbers,
instead of having to do exact queries for
all the runs.

This process can collapse an effort that takes hours into minutes or seconds

Big Data Analytics At the End of the Day

AD HOC
PERFORMANCE

SCALABILITY

LOAD SPEEDS

HIGH AVAILABILITY

LOW TOUCH

COMPRESSION

TCO

AFFORDABILITY

Perceptions & Questions

Analyst:
Robin Bloor


The Briefing Room

The Current Disposition

u 
u 
u 
u 
u 
u 

10 bn connected devices
13 to 14 bn new processors
embedded every year
Estimate 31 bn connected
devices by 2020
Sensors, RFID tags, DSPs,
FPGAs, CPUs, etc.
To control, alert, log and
report
Data growth at 55% pa

IOT Data Characteristics
u 
u 
u 
u 
u 
u 

Arrives in continuous streams
Generally reliable (i.e., not
in need of cleansing)
Very high volume
“Big tables” of predictably
structured data
So, very little need for ETL
activity
If “valuable” then processing
speed is likely to be critical

IOT Apps and Database
u 
u 
u 
u 

u 
u 

Mostly streaming – for alerts
and BI (analysis, discovery)
DBMS choice is a “horses for
courses” thing
If performance matters,
probably not a Hadoop app
The data structure does not
favor the prominent NoSQL
DBMSs
Traditional RDBMS will not do
well
Hence column-store
approach is most logical

The Coming Inversion
1. Instrument existing
(dumb) devices

2. Gather and analyze
data

3. Redesign device and
its instrumentation
from knowledge gained

4. Iterate

Going Forward

In terms of

DATA VOLUMES
we expect the

IOT DATA VOLUME
to swamp all other
sources of data

u  Do

the high compression rates you achieve occur
because it is machine data, i.e., it’s a function of
the characteristics of the data?

u  Is

the “approximate query” an Infobright
invention?

u  How

frequently do customers use this type of
query and for what type of applications?

u  Who,

typically, are the Infobright end users?

u  What

“relationship” does Infobright favor with
Hadoop?

u  What

statistical functions, if any, does Infobright
offer?

u  What

does the product roadmap look like?


The Briefing Room

Upcoming Topics

This Month: INNOVATORS
January: ANALYTICS
February: BIG DATA
2014 Editorial Calendar at

www.insideanalysis.com/webcasts/the-briefing-room

www.insideanalysis.com


The Briefing Room

Thank You
for Your
Attention


The Briefing Room

Outside the Box: Alternate Query Models and the Future of Big Data

Recomendados

Recomendados

Más contenido relacionado

Más de Inside Analysis

Más de Inside Analysis (20)

Último

Último (20)

Outside the Box: Alternate Query Models and the Future of Big Data