Big data spain 2013 - ad networks analytics

Iván
de
Prado
Alonso
–
CEO
of
Datasalt

www.datasalt.es

@ivanprado

@datasalt

Ad Networks analytics using
Hadoop and Splout SQL

Big Data consulting &
training

Agenda

1.  Analy,cs
for
Ad
Networks

2.  Our
solu,on

1.  Hadoop
+
Splout
SQL

2.  Splout
SQL
in
detail

3.  Pre-‐aggregaFons
v.s.
Sampling

3.  Conclusions

Analy,cs
for
Ad

Networks

Ad
Networks

" Principal
agents

›  AdverFser

›  Publisher

•  Web
pages

•  Mobile
apps

" Ad
Network

›  Network
of
agents
that
mediate
between

adverFsers
and
publishers

›  DSPs,
SSPs,
DMPs,
ADTs,
ITDs,
etc

For
the
sake
of
simplicity...

" Let’s
consider
a
monolithic
Ad
Network

›  Single
agent
between
adverFsers
and
publishers

" But
the
exposed
solu,on
is
also
useful
for
DSPs,

SSPs,
DMPs,
etc.

Need
for
analy,cs

" For
adver,sers

›  Monitoring
campaigns

›  Improve
ROI

" For
publishers

›  Improve
ad
placement

" But
there
can
be

›  Tens
of
thousands
of
adverFsers

›  Hundred
of
thousands
of
publishers

Analy,cs

" Coun,ng
impressions,
clicks
and
CPC

›  For
a
given
range
of
dates

›  Filtered
by

•  Campaign

•  LocaFon

•  Language

•  Browser/device

•  Ad
type

•  ...
or
any
combinaFon
of
the
above!

Two-‐fold
usage

" Opera,onal

›  For
invoicing,
accounFng,
etc.

›  Limited
set
of
parameter
variaFons

•  Fixed
date
ranges
and
common
aggregaFons

›  Exact
results
expected

" Exploratory

›  Unlimited
variaFons
of
parameters

•  Ad-‐hoc
ﬁltering

›  Approximated
results
are
enough

Challenges

" Billions
of
events
and
hundreds
of
gigabytes

per
day

›  Need
for
a
distributed
system

" Query
ﬂexibility

›  Need
to
cope
with
operaFonal
and
exploratory

queries

"
Web
latencies

›  Queries
must
return
in
milliseconds

Exploding

" Data
needed
to
serve
analy,cs
panels
is
Big

Data

›  Thousands
of
adverFser
panels

›  Even
more
for
publisher
panels

" But
individually
each
agent
panel
can
be

served
with
one
machine

›  At
least
for
the
98%
of
adverFsers/publishers

›  Horizontal
parFFoning
is
a
good
strategy

Hadoop

" Scalable

›  Storage
of
raw
data

›  CompuFng
capabiliFes

" Good
for

›  CreaFng
pre-‐computed
aggregaFons
(views)

›  GeneraFng
samples
of
data

" Bad
for

›  Serving
data

›  On-‐line
aggregaFons

" Scalable

›  Serving
of
full
SQL
queries
(unlike
NoSQLs)

" Good
for

›  Ad-‐hoc
aggregaFons
over
pre-‐computed
views

›  Serving
low-‐latency
web
pages
with
concurrency

A
well-‐balanced
solu,on

" Hadoop

›  Provides
a
scalable
repository
for
impressions

›  Performs
oﬀ-‐line
pre-‐aggregaFons
and
sampling

" Splout
SQL

›  Serves
queries

›  Performs
on-‐line
aggregaFons
in
sub-‐second

latencies

•  Each
parFFon
contains
only
data
for
a
few
agents,

which
ensures
performance

Splout
SQL

(in
detail)

Splout
SQL
in
detail

Isola,on
between
genera,on
and
serving

Splout
SQL
Architecture

Genera,on

Generate
tablespace
T_ADVERTISERS
with
2
parFFons
for

table
ADVERTISERS

parFFoned
by
CID

table
IMPRESIONS

parFFoned
by
CID

Tablespace
T_ADVERTISERS

ADVERTISERS

AID

Name

ParFFon
U10
–
U35

U20

Doug

ADVERTISERS

U21

Ted

AID

Name

PID

U40

John

U20

Doug

S100
U20

102

U21

Ted

S101
U20

60

IMPRESSIONS

PID

AID

Amount

S100
U20

102

S101
U20

60

S223
U40

99

IMPRESSIONS

AID

Amount

ParFFon
U36
–
U60

ADVERTISERS

IMPRESSIONS

AID

Name

PID

U40

John

S223
U40

AID

Amount

99

API
-‐
Genera,on

Command
line

Loading
CSV
ﬁles

$ hadoop jar splout-*-hadoop.jar generate …

Java
API

HCatalog

Hive

Pig

Serving

For
key
=
‘U20’,
tablespace=‘T_ADVERTISERS’

SELECT
Name,
sum(Amount)
FROM

ADVERTISERS
a,
IMPRESSIONS
i
WHERE

a.AID
=
i.AID
AND
AID
=
‘U20’;

ParFFon
U10
–
U35

ParFFon
U36
–
U60

ADVERTISERS

ADVERTISERS

AID

Name

U20

Doug

U21

Ted

AID

Name

U40

John

IMPRESSIONS

PID

AID

IMPRESSIONS

Amount

PID

S100
U20

102

S223
U40

S101
U20

60

AID

Amount

99

Serving

For
key
=
‘U40’,
tablespace=‘T_ADVERTISERS’

SELECT
Name,
sum(Amount)
FROM

ADVERTISERS
a,
IMPRESSIONS
i
WHERE

a.AID
=
i.AID
AND
AID
=
‘U40’;

ParFFon
U10
–
U35

ParFFon
U36
–
U60

ADVERTISERS

ADVERTISERS

AID

Name

U20

Doug

U21

Ted

AID

Name

U40

John

IMPRESSIONS

PID

AID

IMPRESSIONS

Amount

PID

S100
U20

102

S223
U40

S101
U20

60

AID

Amount

99

API
-‐
Service

Rest
API

JSON
response

Pre-‐aggrega,ons

v.s.

Sampling

Opera,onal
usage

" Invoicing,
accoun,ng,
monitoring,
etc.

› 
› 

Exact
results

Constrained
space
of
aggregaFons

" Pre-‐computed
aggregates
done
in
Hadoop

› 

For
example:

•  per
day

•  per
day
per
locaFon

" Extended
aggrega,ons
done
on-‐line

› 
› 

Using
Splout
SQL

For
example,
aggregate
per
week
based
on
daily

stats

Why
not
to
pre-‐compute
everything?

" Create
one
table
per
each
dimension

combina,on

›  For
two
dimensions
(day,
locaFon):

•  day

•  locaFon

•  locaFon,
day

" For
n
dimensions

›  2n
–
1
combinaFons

›  It
explodes!

Exploratory
usage

" Ad-‐hoc
ﬁlters
to
learn
from
data

›  Approximated
results
are
enough

" Intensive
use
of
sampling

›  It
can
provide
good
accuracy
with
fast
response

" Conﬁdence
interval

›  p=proporFon

›  n=sample
size

p ± z! /2

›  z=normal
distribuFon

p ! (1" p)
n

Samples

" Created
on
Hadoop

›  Diﬀerent
sample
sets

•  For
last
X
days

•  For
last
year

" Splout
SQL
for
serving
them

•  On-‐line
analyFcs
over
samples

•  1
Million
records
per
second*
(44
bytes
per
row)

•  Faster
with
data
in
memory

ü  Warming
data
prior
use

ü  2.7
Million
records
per
second*

*
Measured
in
a
laptop

Pre-‐aggrega,ons
pros
&
cons

" Advantages

›  Exact
results

›  Good
for
exploring
the
long-‐tail

" Limita,ons

›  Only
for
a
constrained
amount
of
aggregaFon

combinaFons

›  Not
good
for
exploratory
analysis

Sampling
pros
&
cons

" Advantages

›  Fast
ﬁltering
for
any
set
of
dimensions

›  Good
accuracy
for
Top
N
queries

" Limita,ons

›  Bad
for
narrow
dimension
ﬁlters

›  Bad
for
exploring
the
long-‐tail

›  Approximated
results

Conclusions

" Analy,cs
in
Ad
Networks
is
a
complex

ques,on

›  Due
to
the
amount
of
data

›  Due
to
the
amount
of
agents

" It
can
be
solved
using
Hadoop
+
Splout
SQL

›  By
the
use
of
parFFoning

›  Using
pre-‐aggregaFons

•  For
operaFve
usages

›  Using
sampling

•  For
exploratory
proﬁles

Iván
de
Prado
Alonso
–
CEO
of
Datasalt

www.datasalt.es

@ivanprado

@datasalt

Questions?

Big data spain 2013 - ad networks analytics

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (20)

Último

Último (20)

Big data spain 2013 - ad networks analytics