SlideShare una empresa de Scribd logo
1 de 35
Descargar para leer sin conexión
Iván	
  de	
  Prado	
  Alonso	
  –	
  CEO	
  of	
  Datasalt	
  
www.datasalt.es	
  
@ivanprado	
  
@datasalt	
  

Ad Networks analytics using
Hadoop and Splout SQL
Big Data consulting &
training
Agenda	
  
1.  Analy,cs	
  for	
  Ad	
  Networks	
  
2.  Our	
  solu,on	
  
1.  Hadoop	
  +	
  Splout	
  SQL	
  
2.  Splout	
  SQL	
  in	
  detail	
  
3.  Pre-­‐aggregaFons	
  v.s.	
  Sampling	
  

3.  Conclusions	
  
Analy,cs	
  for	
  Ad	
  
Networks	
  
Ad	
  Networks	
  
" Principal	
  agents	
  
›  AdverFser	
  
›  Publisher	
  
•  Web	
  pages	
  
•  Mobile	
  apps	
  

" Ad	
  Network	
  
›  Network	
  of	
  agents	
  that	
  mediate	
  between	
  

adverFsers	
  and	
  publishers	
  
›  DSPs,	
  SSPs,	
  DMPs,	
  ADTs,	
  ITDs,	
  etc	
  	
  
For	
  the	
  sake	
  of	
  simplicity...	
  
" Let’s	
  consider	
  a	
  monolithic	
  Ad	
  Network	
  
›  Single	
  agent	
  between	
  adverFsers	
  and	
  publishers	
  

	
  
" But	
  the	
  exposed	
  solu,on	
  is	
  also	
  useful	
  for	
  DSPs,	
  
SSPs,	
  DMPs,	
  etc.	
  
Need	
  for	
  analy,cs	
  
" For	
  adver,sers	
  
›  Monitoring	
  campaigns	
  
›  Improve	
  ROI	
  

" For	
  publishers	
  
›  Improve	
  ad	
  placement	
  

" But	
  there	
  can	
  be	
  
›  Tens	
  of	
  thousands	
  of	
  adverFsers	
  
›  Hundred	
  of	
  thousands	
  of	
  publishers	
  	
  
Analy,cs	
  
" Coun,ng	
  impressions,	
  clicks	
  and	
  CPC	
  
›  For	
  a	
  given	
  range	
  of	
  dates	
  
›  Filtered	
  by	
  
•  Campaign	
  
•  LocaFon	
  
•  Language	
  
•  Browser/device	
  
•  Ad	
  type	
  
•  ...	
  or	
  any	
  combinaFon	
  of	
  the	
  above!	
  
Two-­‐fold	
  usage	
  
" Opera,onal	
  
›  For	
  invoicing,	
  accounFng,	
  etc.	
  
›  Limited	
  set	
  of	
  parameter	
  variaFons	
  
•  Fixed	
  date	
  ranges	
  and	
  common	
  aggregaFons	
  
›  Exact	
  results	
  expected	
  

" Exploratory	
  
›  Unlimited	
  variaFons	
  of	
  parameters	
  
•  Ad-­‐hoc	
  filtering	
  
›  Approximated	
  results	
  are	
  enough	
  
Challenges	
  
" Billions	
  of	
  events	
  and	
  hundreds	
  of	
  gigabytes	
  
per	
  day	
  
›  Need	
  for	
  a	
  distributed	
  system	
  

" Query	
  flexibility	
  
›  Need	
  to	
  cope	
  with	
  operaFonal	
  and	
  exploratory	
  

queries	
  

" 	
  Web	
  latencies	
  
›  Queries	
  must	
  return	
  in	
  milliseconds	
  
Exploding	
  
" Data	
  needed	
  to	
  serve	
  analy,cs	
  panels	
  is	
  Big	
  
Data	
  
›  Thousands	
  of	
  adverFser	
  panels	
  
›  Even	
  more	
  for	
  publisher	
  panels	
  

" But	
  individually	
  each	
  agent	
  panel	
  can	
  be	
  
served	
  with	
  one	
  machine	
  
›  At	
  least	
  for	
  the	
  98%	
  of	
  adverFsers/publishers	
  
›  Horizontal	
  parFFoning	
  is	
  a	
  good	
  strategy	
  
Our	
  solu,on	
  
Our	
  solu,on	
  
Hadoop	
  
" Scalable	
  	
  
›  Storage	
  of	
  raw	
  data	
  
›  CompuFng	
  capabiliFes	
  

" Good	
  for	
  
›  CreaFng	
  pre-­‐computed	
  aggregaFons	
  (views)	
  
›  GeneraFng	
  samples	
  of	
  data	
  

" Bad	
  for	
  
›  Serving	
  data	
  
›  On-­‐line	
  aggregaFons	
  
" Scalable	
  
›  Serving	
  of	
  full	
  SQL	
  queries	
  (unlike	
  NoSQLs)	
  

" Good	
  for	
  
›  Ad-­‐hoc	
  aggregaFons	
  over	
  pre-­‐computed	
  views	
  
›  Serving	
  low-­‐latency	
  web	
  pages	
  with	
  concurrency	
  
A	
  well-­‐balanced	
  solu,on	
  
" Hadoop	
  
›  Provides	
  a	
  scalable	
  repository	
  for	
  impressions	
  
›  Performs	
  off-­‐line	
  pre-­‐aggregaFons	
  and	
  sampling	
  

" Splout	
  SQL	
  
›  Serves	
  queries	
  
›  Performs	
  on-­‐line	
  aggregaFons	
  in	
  sub-­‐second	
  

latencies	
  

•  Each	
  parFFon	
  contains	
  only	
  data	
  for	
  a	
  few	
  agents,	
  
which	
  ensures	
  performance	
  
Splout	
  SQL	
  
(in	
  detail)	
  
Splout	
  SQL	
  in	
  detail	
  

Isola,on	
  between	
  genera,on	
  and	
  serving	
  
Splout	
  SQL	
  Architecture	
  
Genera,on	
  

Generate	
  tablespace	
  T_ADVERTISERS	
  with	
  2	
  parFFons	
  for	
  	
  

table	
  ADVERTISERS	
  

parFFoned	
  by	
  CID	
  

table	
  IMPRESIONS	
  

parFFoned	
  by	
  CID	
  

Tablespace	
  T_ADVERTISERS	
  

ADVERTISERS	
  
AID	
  

Name	
  

ParFFon	
  U10	
  –	
  U35	
  

U20	
  

Doug	
  

ADVERTISERS	
  

U21	
  

Ted	
  

AID	
  

Name	
  

PID	
  

U40	
  

John	
  

U20	
  

Doug	
  

S100	
   U20	
  

102	
  

U21	
  

Ted	
  

S101	
   U20	
  

60	
  

IMPRESSIONS	
  
PID	
  

AID	
  

Amount	
  

S100	
   U20	
  

102	
  

S101	
   U20	
  

60	
  

S223	
   U40	
  

99	
  

IMPRESSIONS	
  
AID	
  

Amount	
  

ParFFon	
  U36	
  –	
  U60	
  
ADVERTISERS	
  

IMPRESSIONS	
  

AID	
  

Name	
  

PID	
  

U40	
  

John	
  

S223	
   U40	
  

AID	
  

Amount	
  
99	
  
API	
  -­‐	
  Genera,on	
  
Command	
  line	
  
Loading	
  CSV	
  files	
  
$ hadoop jar splout-*-hadoop.jar generate …

Java	
  API	
  

HCatalog	
  
Hive	
  
Pig	
  
Serving	
  

For	
  key	
  =	
  ‘U20’,	
  tablespace=‘T_ADVERTISERS’	
  
SELECT	
  Name,	
  sum(Amount)	
  FROM	
  	
  
ADVERTISERS	
  a,	
  IMPRESSIONS	
  i	
  WHERE	
  	
  
a.AID	
  =	
  i.AID	
  AND	
  AID	
  =	
  ‘U20’;	
  
	
  

ParFFon	
  U10	
  –	
  U35	
  

ParFFon	
  U36	
  –	
  U60	
  

ADVERTISERS	
  

ADVERTISERS	
  

AID	
  

Name	
  

U20	
  

Doug	
  

U21	
  

Ted	
  

AID	
  

Name	
  

U40	
  

John	
  

IMPRESSIONS	
  
PID	
  

AID	
  

IMPRESSIONS	
  

Amount	
  

PID	
  

S100	
   U20	
  

102	
  

S223	
   U40	
  

S101	
   U20	
  

60	
  

AID	
  

Amount	
  
99	
  
Serving	
  

For	
  key	
  =	
  ‘U40’,	
  tablespace=‘T_ADVERTISERS’	
  
SELECT	
  Name,	
  sum(Amount)	
  FROM	
  	
  
ADVERTISERS	
  a,	
  IMPRESSIONS	
  i	
  WHERE	
  	
  
a.AID	
  =	
  i.AID	
  AND	
  AID	
  =	
  ‘U40’;	
  
	
  

ParFFon	
  U10	
  –	
  U35	
  

ParFFon	
  U36	
  –	
  U60	
  

ADVERTISERS	
  

ADVERTISERS	
  

AID	
  

Name	
  

U20	
  

Doug	
  

U21	
  

Ted	
  

AID	
  

Name	
  

U40	
  

John	
  

IMPRESSIONS	
  
PID	
  

AID	
  

IMPRESSIONS	
  

Amount	
  

PID	
  

S100	
   U20	
  

102	
  

S223	
   U40	
  

S101	
   U20	
  

60	
  

AID	
  

Amount	
  
99	
  
API	
  -­‐	
  Service	
  
Rest	
  API	
  

JSON	
  response	
  
API	
  -­‐	
  Console	
  
Pre-­‐aggrega,ons	
  
v.s.	
  	
  
Sampling	
  
Opera,onal	
  usage	
  
" Invoicing,	
  accoun,ng,	
  monitoring,	
  etc.	
  
› 
› 

Exact	
  results	
  
Constrained	
  space	
  of	
  aggregaFons	
  

" Pre-­‐computed	
  aggregates	
  done	
  in	
  Hadoop	
  
› 

For	
  example:	
  

•  per	
  day	
  
•  per	
  day	
  per	
  locaFon	
  

" Extended	
  aggrega,ons	
  done	
  on-­‐line	
  
› 
› 

Using	
  Splout	
  SQL	
  
For	
  example,	
  aggregate	
  per	
  week	
  based	
  on	
  daily	
  
stats	
  
Why	
  not	
  to	
  pre-­‐compute	
  everything?	
  
" Create	
  one	
  table	
  per	
  each	
  dimension	
  
combina,on	
  
›  For	
  two	
  dimensions	
  (day,	
  locaFon):	
  
•  day	
  
•  locaFon	
  
•  locaFon,	
  day	
  

" For	
  n	
  dimensions	
  
›  2n	
  –	
  1	
  combinaFons	
  
›  It	
  explodes!	
  
Exploratory	
  usage	
  
" Ad-­‐hoc	
  filters	
  to	
  learn	
  from	
  data	
  
›  Approximated	
  results	
  are	
  enough	
  

" Intensive	
  use	
  of	
  sampling	
  
›  It	
  can	
  provide	
  good	
  accuracy	
  with	
  fast	
  response	
  

" Confidence	
  interval	
  
›  p=proporFon	
  
›  n=sample	
  size	
  

p ± z! /2

›  z=normal	
  distribuFon	
  

p ! (1" p)
n
Samples	
  
" Created	
  on	
  Hadoop	
  
›  Different	
  sample	
  sets	
  
•  For	
  last	
  X	
  days	
  
•  For	
  last	
  year	
  

" Splout	
  SQL	
  for	
  serving	
  them	
  
•  On-­‐line	
  analyFcs	
  over	
  samples	
  
•  1	
  Million	
  records	
  per	
  second*	
  (44	
  bytes	
  per	
  row)	
  
•  Faster	
  with	
  data	
  in	
  memory	
  	
  
ü  Warming	
  data	
  prior	
  use	
  
ü  2.7	
  Million	
  records	
  per	
  second*	
  
*	
  Measured	
  in	
  a	
  laptop	
  
Pre-­‐aggrega,ons	
  pros	
  &	
  cons	
  
" Advantages	
  
›  Exact	
  results	
  
›  Good	
  for	
  exploring	
  the	
  long-­‐tail	
  

" Limita,ons	
  
›  Only	
  for	
  a	
  constrained	
  amount	
  of	
  aggregaFon	
  

combinaFons	
  
›  Not	
  good	
  for	
  exploratory	
  analysis	
  
Sampling	
  pros	
  &	
  cons	
  
" Advantages	
  
›  Fast	
  filtering	
  for	
  any	
  set	
  of	
  dimensions	
  
›  Good	
  accuracy	
  for	
  Top	
  N	
  queries	
  

" Limita,ons	
  
›  Bad	
  for	
  narrow	
  dimension	
  filters	
  
›  Bad	
  for	
  exploring	
  the	
  long-­‐tail	
  
›  Approximated	
  results	
  	
  
Conclusions	
  
Conclusions	
  
" Analy,cs	
  in	
  Ad	
  Networks	
  is	
  a	
  complex	
  
ques,on	
  
›  Due	
  to	
  the	
  amount	
  of	
  data	
  
›  Due	
  to	
  the	
  amount	
  of	
  agents	
  

" It	
  can	
  be	
  solved	
  using	
  Hadoop	
  +	
  Splout	
  SQL	
  
›  By	
  the	
  use	
  of	
  parFFoning	
  
›  Using	
  pre-­‐aggregaFons	
  
•  For	
  operaFve	
  usages	
  
›  Using	
  sampling	
  
•  For	
  exploratory	
  profiles	
  
Iván	
  de	
  Prado	
  Alonso	
  –	
  CEO	
  of	
  Datasalt	
  
www.datasalt.es	
  
@ivanprado	
  
@datasalt	
  

Questions?

Más contenido relacionado

Destacado

Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainKamal A
 
Proof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesProof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesDataWorks Summit
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsAnju Singh
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitSlim Baltagi
 
Presentacion appa
Presentacion appaPresentacion appa
Presentacion appaIsabel Arce
 
What makes entrepreneurs entrepreneurial
What makes entrepreneurs entrepreneurialWhat makes entrepreneurs entrepreneurial
What makes entrepreneurs entrepreneurialBenjamin Crucq
 
Star image paolo nutini
Star image paolo nutiniStar image paolo nutini
Star image paolo nutiniibutt5
 
Materials - Part II
Materials - Part IIMaterials - Part II
Materials - Part IIvigyanashram
 
RoboScan1MC datasheet
RoboScan1MC datasheetRoboScan1MC datasheet
RoboScan1MC datasheetriskis
 
Victoria’s Photo Album The Collins Edition
Victoria’s Photo Album The Collins EditionVictoria’s Photo Album The Collins Edition
Victoria’s Photo Album The Collins EditionTonyCollins
 
Cholesterol[1]
Cholesterol[1]Cholesterol[1]
Cholesterol[1]sunfresh98
 

Destacado (20)

BIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECTBIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECT
 
BIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECTBIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECT
 
BIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECTBIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECT
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domain
 
Proof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesProof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-series
 
Arquitectura Lambda
Arquitectura LambdaArquitectura Lambda
Arquitectura Lambda
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
 
Build application using sbt
Build application using sbtBuild application using sbt
Build application using sbt
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
 
Hadoop admiin demo
Hadoop admiin demoHadoop admiin demo
Hadoop admiin demo
 
Apache spark basics
Apache spark basicsApache spark basics
Apache spark basics
 
Presentacion appa
Presentacion appaPresentacion appa
Presentacion appa
 
What makes entrepreneurs entrepreneurial
What makes entrepreneurs entrepreneurialWhat makes entrepreneurs entrepreneurial
What makes entrepreneurs entrepreneurial
 
Very inspiring
Very inspiringVery inspiring
Very inspiring
 
Star image paolo nutini
Star image paolo nutiniStar image paolo nutini
Star image paolo nutini
 
Materials - Part II
Materials - Part IIMaterials - Part II
Materials - Part II
 
RoboScan1MC datasheet
RoboScan1MC datasheetRoboScan1MC datasheet
RoboScan1MC datasheet
 
Victoria’s Photo Album The Collins Edition
Victoria’s Photo Album The Collins EditionVictoria’s Photo Album The Collins Edition
Victoria’s Photo Album The Collins Edition
 
Cholesterol[1]
Cholesterol[1]Cholesterol[1]
Cholesterol[1]
 

Último

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 

Último (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Big data spain 2013 - ad networks analytics

  • 1. Iván  de  Prado  Alonso  –  CEO  of  Datasalt   www.datasalt.es   @ivanprado   @datasalt   Ad Networks analytics using Hadoop and Splout SQL
  • 2. Big Data consulting & training
  • 3. Agenda   1.  Analy,cs  for  Ad  Networks   2.  Our  solu,on   1.  Hadoop  +  Splout  SQL   2.  Splout  SQL  in  detail   3.  Pre-­‐aggregaFons  v.s.  Sampling   3.  Conclusions  
  • 4. Analy,cs  for  Ad   Networks  
  • 5. Ad  Networks   " Principal  agents   ›  AdverFser   ›  Publisher   •  Web  pages   •  Mobile  apps   " Ad  Network   ›  Network  of  agents  that  mediate  between   adverFsers  and  publishers   ›  DSPs,  SSPs,  DMPs,  ADTs,  ITDs,  etc    
  • 6. For  the  sake  of  simplicity...   " Let’s  consider  a  monolithic  Ad  Network   ›  Single  agent  between  adverFsers  and  publishers     " But  the  exposed  solu,on  is  also  useful  for  DSPs,   SSPs,  DMPs,  etc.  
  • 7. Need  for  analy,cs   " For  adver,sers   ›  Monitoring  campaigns   ›  Improve  ROI   " For  publishers   ›  Improve  ad  placement   " But  there  can  be   ›  Tens  of  thousands  of  adverFsers   ›  Hundred  of  thousands  of  publishers    
  • 8. Analy,cs   " Coun,ng  impressions,  clicks  and  CPC   ›  For  a  given  range  of  dates   ›  Filtered  by   •  Campaign   •  LocaFon   •  Language   •  Browser/device   •  Ad  type   •  ...  or  any  combinaFon  of  the  above!  
  • 9. Two-­‐fold  usage   " Opera,onal   ›  For  invoicing,  accounFng,  etc.   ›  Limited  set  of  parameter  variaFons   •  Fixed  date  ranges  and  common  aggregaFons   ›  Exact  results  expected   " Exploratory   ›  Unlimited  variaFons  of  parameters   •  Ad-­‐hoc  filtering   ›  Approximated  results  are  enough  
  • 10. Challenges   " Billions  of  events  and  hundreds  of  gigabytes   per  day   ›  Need  for  a  distributed  system   " Query  flexibility   ›  Need  to  cope  with  operaFonal  and  exploratory   queries   "  Web  latencies   ›  Queries  must  return  in  milliseconds  
  • 11. Exploding   " Data  needed  to  serve  analy,cs  panels  is  Big   Data   ›  Thousands  of  adverFser  panels   ›  Even  more  for  publisher  panels   " But  individually  each  agent  panel  can  be   served  with  one  machine   ›  At  least  for  the  98%  of  adverFsers/publishers   ›  Horizontal  parFFoning  is  a  good  strategy  
  • 14. Hadoop   " Scalable     ›  Storage  of  raw  data   ›  CompuFng  capabiliFes   " Good  for   ›  CreaFng  pre-­‐computed  aggregaFons  (views)   ›  GeneraFng  samples  of  data   " Bad  for   ›  Serving  data   ›  On-­‐line  aggregaFons  
  • 15. " Scalable   ›  Serving  of  full  SQL  queries  (unlike  NoSQLs)   " Good  for   ›  Ad-­‐hoc  aggregaFons  over  pre-­‐computed  views   ›  Serving  low-­‐latency  web  pages  with  concurrency  
  • 16. A  well-­‐balanced  solu,on   " Hadoop   ›  Provides  a  scalable  repository  for  impressions   ›  Performs  off-­‐line  pre-­‐aggregaFons  and  sampling   " Splout  SQL   ›  Serves  queries   ›  Performs  on-­‐line  aggregaFons  in  sub-­‐second   latencies   •  Each  parFFon  contains  only  data  for  a  few  agents,   which  ensures  performance  
  • 17. Splout  SQL   (in  detail)  
  • 18. Splout  SQL  in  detail   Isola,on  between  genera,on  and  serving  
  • 20. Genera,on   Generate  tablespace  T_ADVERTISERS  with  2  parFFons  for     table  ADVERTISERS   parFFoned  by  CID   table  IMPRESIONS   parFFoned  by  CID   Tablespace  T_ADVERTISERS   ADVERTISERS   AID   Name   ParFFon  U10  –  U35   U20   Doug   ADVERTISERS   U21   Ted   AID   Name   PID   U40   John   U20   Doug   S100   U20   102   U21   Ted   S101   U20   60   IMPRESSIONS   PID   AID   Amount   S100   U20   102   S101   U20   60   S223   U40   99   IMPRESSIONS   AID   Amount   ParFFon  U36  –  U60   ADVERTISERS   IMPRESSIONS   AID   Name   PID   U40   John   S223   U40   AID   Amount   99  
  • 21. API  -­‐  Genera,on   Command  line   Loading  CSV  files   $ hadoop jar splout-*-hadoop.jar generate … Java  API   HCatalog   Hive   Pig  
  • 22. Serving   For  key  =  ‘U20’,  tablespace=‘T_ADVERTISERS’   SELECT  Name,  sum(Amount)  FROM     ADVERTISERS  a,  IMPRESSIONS  i  WHERE     a.AID  =  i.AID  AND  AID  =  ‘U20’;     ParFFon  U10  –  U35   ParFFon  U36  –  U60   ADVERTISERS   ADVERTISERS   AID   Name   U20   Doug   U21   Ted   AID   Name   U40   John   IMPRESSIONS   PID   AID   IMPRESSIONS   Amount   PID   S100   U20   102   S223   U40   S101   U20   60   AID   Amount   99  
  • 23. Serving   For  key  =  ‘U40’,  tablespace=‘T_ADVERTISERS’   SELECT  Name,  sum(Amount)  FROM     ADVERTISERS  a,  IMPRESSIONS  i  WHERE     a.AID  =  i.AID  AND  AID  =  ‘U40’;     ParFFon  U10  –  U35   ParFFon  U36  –  U60   ADVERTISERS   ADVERTISERS   AID   Name   U20   Doug   U21   Ted   AID   Name   U40   John   IMPRESSIONS   PID   AID   IMPRESSIONS   Amount   PID   S100   U20   102   S223   U40   S101   U20   60   AID   Amount   99  
  • 24. API  -­‐  Service   Rest  API   JSON  response  
  • 27. Opera,onal  usage   " Invoicing,  accoun,ng,  monitoring,  etc.   ›  ›  Exact  results   Constrained  space  of  aggregaFons   " Pre-­‐computed  aggregates  done  in  Hadoop   ›  For  example:   •  per  day   •  per  day  per  locaFon   " Extended  aggrega,ons  done  on-­‐line   ›  ›  Using  Splout  SQL   For  example,  aggregate  per  week  based  on  daily   stats  
  • 28. Why  not  to  pre-­‐compute  everything?   " Create  one  table  per  each  dimension   combina,on   ›  For  two  dimensions  (day,  locaFon):   •  day   •  locaFon   •  locaFon,  day   " For  n  dimensions   ›  2n  –  1  combinaFons   ›  It  explodes!  
  • 29. Exploratory  usage   " Ad-­‐hoc  filters  to  learn  from  data   ›  Approximated  results  are  enough   " Intensive  use  of  sampling   ›  It  can  provide  good  accuracy  with  fast  response   " Confidence  interval   ›  p=proporFon   ›  n=sample  size   p ± z! /2 ›  z=normal  distribuFon   p ! (1" p) n
  • 30. Samples   " Created  on  Hadoop   ›  Different  sample  sets   •  For  last  X  days   •  For  last  year   " Splout  SQL  for  serving  them   •  On-­‐line  analyFcs  over  samples   •  1  Million  records  per  second*  (44  bytes  per  row)   •  Faster  with  data  in  memory     ü  Warming  data  prior  use   ü  2.7  Million  records  per  second*   *  Measured  in  a  laptop  
  • 31. Pre-­‐aggrega,ons  pros  &  cons   " Advantages   ›  Exact  results   ›  Good  for  exploring  the  long-­‐tail   " Limita,ons   ›  Only  for  a  constrained  amount  of  aggregaFon   combinaFons   ›  Not  good  for  exploratory  analysis  
  • 32. Sampling  pros  &  cons   " Advantages   ›  Fast  filtering  for  any  set  of  dimensions   ›  Good  accuracy  for  Top  N  queries   " Limita,ons   ›  Bad  for  narrow  dimension  filters   ›  Bad  for  exploring  the  long-­‐tail   ›  Approximated  results    
  • 34. Conclusions   " Analy,cs  in  Ad  Networks  is  a  complex   ques,on   ›  Due  to  the  amount  of  data   ›  Due  to  the  amount  of  agents   " It  can  be  solved  using  Hadoop  +  Splout  SQL   ›  By  the  use  of  parFFoning   ›  Using  pre-­‐aggregaFons   •  For  operaFve  usages   ›  Using  sampling   •  For  exploratory  profiles  
  • 35. Iván  de  Prado  Alonso  –  CEO  of  Datasalt   www.datasalt.es   @ivanprado   @datasalt   Questions?