Ad Networks act as the middleman between advertisers and publishers on the Internet. The advertiser is the agent that wants to allocate a particular ad in different medias. The publisher is the agent who owns the medias. These medias are usually web pages or mobile applications.
Each time an ad is shown in a web page or in a mobile application an impression event is generated. These impressions and other events are the source of analytical panels that are used by the agents (advertiser and publisher) to analyze the performance of its campaigns or its web pages.
Presenting these panels to the agents is a technical challenge because Ad Networks have to deal with billions of events each day, and have to present interactive panels to thousands of agents. The scale of the problem requires the usage of distributed tools. Obviously, Hadoop may come to the rescue with its storage and computing capacity. It can be used to precompute several statistics that are later presented in the panels.
But that is not enough for the agents. In order to perform exploratory analytics they need an interactive panel that allows to filter down by a particular web page, country and device in a particular time-frame, or whichever other ad-hoc filter.
Therefore, something more than Hadoop is needed in order to store the data and to perform some statical precomputations. At Datasalt, we have addressed this problem for some clients and we have found a solution than will be presented in the talk.
The solution includes two modules: the off-line and the on-line.
Off-line
The off-line module is in charge of storing the received events and preforming the most costly operations: cleaning the dataset; performing some aggregations in order to reduce the size of the data; and create some file structures that will be used later to serve the on-line analytics. All these tasks are handled properly by Hadoop. The most innovative part on this process is the last step where some file-structures are created for being exported to the on-line part in order to serve the analytical panels.
On-line
The on-line module is in charge of serving the analytical queries received from the agents' panel webapp. The queries are basic statistics (count, count distinct, stdev, sum, etc) run over a subset of the input dataset represented by an ad-hoc filter. The challenge here is that the system has to serve statistics for filters “on the fly”. That makes it impossible to precalculate everything on the off-line side. Therefore, part of the calculations must be done on-demand. That would not be a problem if the scale of the data wouldn't be that big. Some kind of scalable database is needed for this task.
5. Ad
Networks
" Principal
agents
› AdverFser
› Publisher
• Web
pages
• Mobile
apps
" Ad
Network
› Network
of
agents
that
mediate
between
adverFsers
and
publishers
› DSPs,
SSPs,
DMPs,
ADTs,
ITDs,
etc
6. For
the
sake
of
simplicity...
" Let’s
consider
a
monolithic
Ad
Network
› Single
agent
between
adverFsers
and
publishers
" But
the
exposed
solu,on
is
also
useful
for
DSPs,
SSPs,
DMPs,
etc.
7. Need
for
analy,cs
" For
adver,sers
› Monitoring
campaigns
› Improve
ROI
" For
publishers
› Improve
ad
placement
" But
there
can
be
› Tens
of
thousands
of
adverFsers
› Hundred
of
thousands
of
publishers
8. Analy,cs
" Coun,ng
impressions,
clicks
and
CPC
› For
a
given
range
of
dates
› Filtered
by
• Campaign
• LocaFon
• Language
• Browser/device
• Ad
type
• ...
or
any
combinaFon
of
the
above!
9. Two-‐fold
usage
" Opera,onal
› For
invoicing,
accounFng,
etc.
› Limited
set
of
parameter
variaFons
• Fixed
date
ranges
and
common
aggregaFons
› Exact
results
expected
" Exploratory
› Unlimited
variaFons
of
parameters
• Ad-‐hoc
filtering
› Approximated
results
are
enough
10. Challenges
" Billions
of
events
and
hundreds
of
gigabytes
per
day
› Need
for
a
distributed
system
" Query
flexibility
› Need
to
cope
with
operaFonal
and
exploratory
queries
"
Web
latencies
› Queries
must
return
in
milliseconds
11. Exploding
" Data
needed
to
serve
analy,cs
panels
is
Big
Data
› Thousands
of
adverFser
panels
› Even
more
for
publisher
panels
" But
individually
each
agent
panel
can
be
served
with
one
machine
› At
least
for
the
98%
of
adverFsers/publishers
› Horizontal
parFFoning
is
a
good
strategy
14. Hadoop
" Scalable
› Storage
of
raw
data
› CompuFng
capabiliFes
" Good
for
› CreaFng
pre-‐computed
aggregaFons
(views)
› GeneraFng
samples
of
data
" Bad
for
› Serving
data
› On-‐line
aggregaFons
15. " Scalable
› Serving
of
full
SQL
queries
(unlike
NoSQLs)
" Good
for
› Ad-‐hoc
aggregaFons
over
pre-‐computed
views
› Serving
low-‐latency
web
pages
with
concurrency
16. A
well-‐balanced
solu,on
" Hadoop
› Provides
a
scalable
repository
for
impressions
› Performs
off-‐line
pre-‐aggregaFons
and
sampling
" Splout
SQL
› Serves
queries
› Performs
on-‐line
aggregaFons
in
sub-‐second
latencies
• Each
parFFon
contains
only
data
for
a
few
agents,
which
ensures
performance
20. Genera,on
Generate
tablespace
T_ADVERTISERS
with
2
parFFons
for
table
ADVERTISERS
parFFoned
by
CID
table
IMPRESIONS
parFFoned
by
CID
Tablespace
T_ADVERTISERS
ADVERTISERS
AID
Name
ParFFon
U10
–
U35
U20
Doug
ADVERTISERS
U21
Ted
AID
Name
PID
U40
John
U20
Doug
S100
U20
102
U21
Ted
S101
U20
60
IMPRESSIONS
PID
AID
Amount
S100
U20
102
S101
U20
60
S223
U40
99
IMPRESSIONS
AID
Amount
ParFFon
U36
–
U60
ADVERTISERS
IMPRESSIONS
AID
Name
PID
U40
John
S223
U40
AID
Amount
99
21. API
-‐
Genera,on
Command
line
Loading
CSV
files
$ hadoop jar splout-*-hadoop.jar generate …
Java
API
HCatalog
Hive
Pig
22. Serving
For
key
=
‘U20’,
tablespace=‘T_ADVERTISERS’
SELECT
Name,
sum(Amount)
FROM
ADVERTISERS
a,
IMPRESSIONS
i
WHERE
a.AID
=
i.AID
AND
AID
=
‘U20’;
ParFFon
U10
–
U35
ParFFon
U36
–
U60
ADVERTISERS
ADVERTISERS
AID
Name
U20
Doug
U21
Ted
AID
Name
U40
John
IMPRESSIONS
PID
AID
IMPRESSIONS
Amount
PID
S100
U20
102
S223
U40
S101
U20
60
AID
Amount
99
23. Serving
For
key
=
‘U40’,
tablespace=‘T_ADVERTISERS’
SELECT
Name,
sum(Amount)
FROM
ADVERTISERS
a,
IMPRESSIONS
i
WHERE
a.AID
=
i.AID
AND
AID
=
‘U40’;
ParFFon
U10
–
U35
ParFFon
U36
–
U60
ADVERTISERS
ADVERTISERS
AID
Name
U20
Doug
U21
Ted
AID
Name
U40
John
IMPRESSIONS
PID
AID
IMPRESSIONS
Amount
PID
S100
U20
102
S223
U40
S101
U20
60
AID
Amount
99
27. Opera,onal
usage
" Invoicing,
accoun,ng,
monitoring,
etc.
›
›
Exact
results
Constrained
space
of
aggregaFons
" Pre-‐computed
aggregates
done
in
Hadoop
›
For
example:
• per
day
• per
day
per
locaFon
" Extended
aggrega,ons
done
on-‐line
›
›
Using
Splout
SQL
For
example,
aggregate
per
week
based
on
daily
stats
28. Why
not
to
pre-‐compute
everything?
" Create
one
table
per
each
dimension
combina,on
› For
two
dimensions
(day,
locaFon):
• day
• locaFon
• locaFon,
day
" For
n
dimensions
› 2n
–
1
combinaFons
› It
explodes!
29. Exploratory
usage
" Ad-‐hoc
filters
to
learn
from
data
› Approximated
results
are
enough
" Intensive
use
of
sampling
› It
can
provide
good
accuracy
with
fast
response
" Confidence
interval
› p=proporFon
› n=sample
size
p ± z! /2
› z=normal
distribuFon
p ! (1" p)
n
30. Samples
" Created
on
Hadoop
› Different
sample
sets
• For
last
X
days
• For
last
year
" Splout
SQL
for
serving
them
• On-‐line
analyFcs
over
samples
• 1
Million
records
per
second*
(44
bytes
per
row)
• Faster
with
data
in
memory
ü Warming
data
prior
use
ü 2.7
Million
records
per
second*
*
Measured
in
a
laptop
31. Pre-‐aggrega,ons
pros
&
cons
" Advantages
› Exact
results
› Good
for
exploring
the
long-‐tail
" Limita,ons
› Only
for
a
constrained
amount
of
aggregaFon
combinaFons
› Not
good
for
exploratory
analysis
32. Sampling
pros
&
cons
" Advantages
› Fast
filtering
for
any
set
of
dimensions
› Good
accuracy
for
Top
N
queries
" Limita,ons
› Bad
for
narrow
dimension
filters
› Bad
for
exploring
the
long-‐tail
› Approximated
results
34. Conclusions
" Analy,cs
in
Ad
Networks
is
a
complex
ques,on
› Due
to
the
amount
of
data
› Due
to
the
amount
of
agents
" It
can
be
solved
using
Hadoop
+
Splout
SQL
› By
the
use
of
parFFoning
› Using
pre-‐aggregaFons
• For
operaFve
usages
› Using
sampling
• For
exploratory
profiles
35. Iván
de
Prado
Alonso
–
CEO
of
Datasalt
www.datasalt.es
@ivanprado
@datasalt
Questions?