SlideShare a Scribd company logo
1 of 19
Download to read offline
Building Audience Analytics Platform 
Jothi Padmanabhan 
Inmobi 
6-Sep-2014
Motivation 
➔Audience Analytics platform is extremely critical 
➔Segmentation 
➔Rule Based 
➔Inferred based on Sciences Modeling 
➔Third Party 
➔Targeting 
➔Maximize CTR and CVR
Challenges 
➔Scale 
➔Billions of Ad requests/day, Peak 25K rps, 800M 
Users 
➔ Multiple Input Sources and Types 
➔Fact Data, Dimension Data 
➔ Multiple Consumers 
➔Reporting, Segmentation and Targeting, Inferences
Challenges 
➔ Data Curation 
➔Define and Measure Data Quality 
➔Track sources and possibly assign 
confidence 
➔Governance and Licensing restrictions 
➔ Consistent Querying Interface
Challenges 
● Storage capacity and retention 
● Optimal usage of grid resources
Activity Data 
➔Records actual activity 
➔Time-series data 
➔ Immutable, actual facts 
➔Comprises Dimensions and Measures 
➔Measures 
➔Ad requests, Impressions, Clicks, Conversion, ...
Dimension Data 
➔Domain specific Metadata (user, location, app 
etc) 
➔Each domain will have its own schema 
➔User (uid, age, gender, interests etc) 
➔Location (Lat/Long – zip/city/country, etc) 
➔Device (Handset model, OS, version etc) 
➔Mutable (but possibly slowly changing)
ETL 
➔Need to ingest data from different 
sources 
➔Transform the data into a format for 
optimized storage and easy queriability 
➔Query interface for different consumers
ETL - Ingestion 
➔Naive -- Have custom ingestion flows 
➔Quick to develop 
➔Could be highly optimized 
➔Not scalable 
➔Have a generic framework 
➔Streamlined and scalable 
➔Might need more processing
ETL - Storage 
➔Naive -- Storage schema closely coupled 
with ingestion schema 
➔Multiple representations of same data. Age 
could be DOB or years 
➔Consitent representation a must 
➔Would require transformation from input 
schema to storage schema
ETL - Storage 
➔Location – Lat/Long, Zip, City, Country 
➔Need to store in the lowest possible granularity 
(Lat/Long) 
➔GPS readings come with accuracy that needs to 
be recorded 
➔Queries are almost always nearness queries, 
not exact matches 
➔
ETL - Storage 
➔Quadtile representation 
➔Use leading bits for tile id, remaining for storing 
accuracy 
➔Transform all location information to such ids 
➔Nearness with Lat/Long distance is a cross-product 
join 
➔With Tiles, we can translate this into equi-joins (of 
course with some loss of accuracy)
ETL - Querying 
➔Naive -- Users aware of multiple feeds 
and schemas, query appropriately 
➔Extremely difficult as schemas change, 
new feeds get added 
➔Closely coupled with internal 
representation, not good
ETL - Querying 
➔Having a consistent, published schema 
➔Enables exploration and discovery 
➔Well defined querying interfaces that 
abstract out internal representation 
➔Provide primitives (for example UDFs for 
nearness calculations) for easier querying
Ingestion Server 
● Curation to filter out dubious records 
● Adapters for transformation 
● REST based ingestion server 
– Support multiple compression types 
– Support multiple serialization formats 
– Handle rate-limiting/throttling 
– Bulk/Streaming inputs 
●
Storage and Querying 
● Possibly different schema than ingestion 
schema 
● Columnar storage format (Parquet/ORC) 
● Predominantly Hive friendly 
● No direct access to internal storage, access 
only through a HQL-like query layer 
● Export option for other use case (online store)
Tech Stack 
● Pig for most pipeline tasks 
● Grill for analytics interface 
● Hive as the primary execution engine 
● Tez as the runtime environment 
● ORC/Parquet for the storage format 
●
Questions

More Related Content

Viewers also liked

InMobi Presentation for IT Minister @iSPIRT Event - Conclave for India as Pr...
InMobi Presentation for IT Minister @iSPIRT Event -  Conclave for India as Pr...InMobi Presentation for IT Minister @iSPIRT Event -  Conclave for India as Pr...
InMobi Presentation for IT Minister @iSPIRT Event - Conclave for India as Pr...
ProductNation/iSPIRT
 

Viewers also liked (17)

Introduction to Threat Modeling
Introduction to Threat ModelingIntroduction to Threat Modeling
Introduction to Threat Modeling
 
Ensemble Methods for Algorithmic Trading
Ensemble Methods for Algorithmic TradingEnsemble Methods for Algorithmic Trading
Ensemble Methods for Algorithmic Trading
 
Big Data and User Segmentation in Mobile Context
Big Data and User Segmentation in Mobile ContextBig Data and User Segmentation in Mobile Context
Big Data and User Segmentation in Mobile Context
 
Optimizer Hints
Optimizer HintsOptimizer Hints
Optimizer Hints
 
Building Machine Learning Pipelines
Building Machine Learning PipelinesBuilding Machine Learning Pipelines
Building Machine Learning Pipelines
 
PostgreSQL 9.5 - Major Features
PostgreSQL 9.5 - Major FeaturesPostgreSQL 9.5 - Major Features
PostgreSQL 9.5 - Major Features
 
Building Spark as Service in Cloud
Building Spark as Service in CloudBuilding Spark as Service in Cloud
Building Spark as Service in Cloud
 
InMobi Presentation for IT Minister @iSPIRT Event - Conclave for India as Pr...
InMobi Presentation for IT Minister @iSPIRT Event -  Conclave for India as Pr...InMobi Presentation for IT Minister @iSPIRT Event -  Conclave for India as Pr...
InMobi Presentation for IT Minister @iSPIRT Event - Conclave for India as Pr...
 
Case Studies on PostgreSQL
Case Studies on PostgreSQLCase Studies on PostgreSQL
Case Studies on PostgreSQL
 
InMobi - The Economics of Building an Advertising Supported App Business
InMobi - The Economics of Building an Advertising Supported App BusinessInMobi - The Economics of Building an Advertising Supported App Business
InMobi - The Economics of Building an Advertising Supported App Business
 
Reflective and Stored XSS- Cross Site Scripting
Reflective and Stored XSS- Cross Site ScriptingReflective and Stored XSS- Cross Site Scripting
Reflective and Stored XSS- Cross Site Scripting
 
How To Succeed With Rewarded Video Ads
How To Succeed With Rewarded Video AdsHow To Succeed With Rewarded Video Ads
How To Succeed With Rewarded Video Ads
 
24/7 Monitoring and Alerting of PostgreSQL
24/7 Monitoring and Alerting of PostgreSQL24/7 Monitoring and Alerting of PostgreSQL
24/7 Monitoring and Alerting of PostgreSQL
 
Mobile marketing strategy guide
Mobile marketing strategy guide Mobile marketing strategy guide
Mobile marketing strategy guide
 
Top 2017 Mobile Advertising Trends in Indonesia
Top 2017 Mobile Advertising Trends in IndonesiaTop 2017 Mobile Advertising Trends in Indonesia
Top 2017 Mobile Advertising Trends in Indonesia
 
Toro DB- Open-source, MongoDB-compatible database, built on top of PostgreSQL
Toro DB- Open-source, MongoDB-compatible database,  built on top of PostgreSQLToro DB- Open-source, MongoDB-compatible database,  built on top of PostgreSQL
Toro DB- Open-source, MongoDB-compatible database, built on top of PostgreSQL
 
The Synapse IoT Stack: Technology Trends in IOT and Big Data
The Synapse IoT Stack: Technology Trends in IOT and Big DataThe Synapse IoT Stack: Technology Trends in IOT and Big Data
The Synapse IoT Stack: Technology Trends in IOT and Big Data
 

Similar to Building Audience Analytics Platform

Scaling systems using change propagation across data stores
Scaling systems using change propagation across data storesScaling systems using change propagation across data stores
Scaling systems using change propagation across data stores
Jagadeesh Huliyar
 
Scalable and Available, Patterns for Success
Scalable and Available, Patterns for SuccessScalable and Available, Patterns for Success
Scalable and Available, Patterns for Success
Derek Collison
 

Similar to Building Audience Analytics Platform (20)

Scalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With Spark
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lk
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleZeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
 
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleZeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data LakeNDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
Module: Mutable Content in IPFS
Module: Mutable Content in IPFSModule: Mutable Content in IPFS
Module: Mutable Content in IPFS
 
Measure() or die()
Measure() or die()Measure() or die()
Measure() or die()
 
Measure() or die()
Measure() or die() Measure() or die()
Measure() or die()
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Scaling systems using change propagation across data stores
Scaling systems using change propagation across data storesScaling systems using change propagation across data stores
Scaling systems using change propagation across data stores
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
 
In Search of Database Nirvana: Challenges of Delivering HTAP
In Search of Database Nirvana: Challenges of Delivering HTAPIn Search of Database Nirvana: Challenges of Delivering HTAP
In Search of Database Nirvana: Challenges of Delivering HTAP
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
 
How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
 
Scalable and Available, Patterns for Success
Scalable and Available, Patterns for SuccessScalable and Available, Patterns for Success
Scalable and Available, Patterns for Success
 

More from InMobi Technology

More from InMobi Technology (11)

HTTP Basics Demo
HTTP Basics DemoHTTP Basics Demo
HTTP Basics Demo
 
What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014
 
Attacking Web Proxies
Attacking Web ProxiesAttacking Web Proxies
Attacking Web Proxies
 
Security News Bytes Null Dec Meet Bangalore
Security News Bytes Null Dec Meet BangaloreSecurity News Bytes Null Dec Meet Bangalore
Security News Bytes Null Dec Meet Bangalore
 
Matriux blue
Matriux blueMatriux blue
Matriux blue
 
PCI DSS v3 - Protecting Cardholder data
PCI DSS v3 - Protecting Cardholder dataPCI DSS v3 - Protecting Cardholder data
PCI DSS v3 - Protecting Cardholder data
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
 
Shodan- That Device Search Engine
Shodan- That Device Search EngineShodan- That Device Search Engine
Shodan- That Device Search Engine
 
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQMassively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Freedom Hack Report 2014
Freedom Hack Report 2014Freedom Hack Report 2014
Freedom Hack Report 2014
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

Building Audience Analytics Platform

  • 1. Building Audience Analytics Platform Jothi Padmanabhan Inmobi 6-Sep-2014
  • 2. Motivation ➔Audience Analytics platform is extremely critical ➔Segmentation ➔Rule Based ➔Inferred based on Sciences Modeling ➔Third Party ➔Targeting ➔Maximize CTR and CVR
  • 3. Challenges ➔Scale ➔Billions of Ad requests/day, Peak 25K rps, 800M Users ➔ Multiple Input Sources and Types ➔Fact Data, Dimension Data ➔ Multiple Consumers ➔Reporting, Segmentation and Targeting, Inferences
  • 4. Challenges ➔ Data Curation ➔Define and Measure Data Quality ➔Track sources and possibly assign confidence ➔Governance and Licensing restrictions ➔ Consistent Querying Interface
  • 5. Challenges ● Storage capacity and retention ● Optimal usage of grid resources
  • 6. Activity Data ➔Records actual activity ➔Time-series data ➔ Immutable, actual facts ➔Comprises Dimensions and Measures ➔Measures ➔Ad requests, Impressions, Clicks, Conversion, ...
  • 7. Dimension Data ➔Domain specific Metadata (user, location, app etc) ➔Each domain will have its own schema ➔User (uid, age, gender, interests etc) ➔Location (Lat/Long – zip/city/country, etc) ➔Device (Handset model, OS, version etc) ➔Mutable (but possibly slowly changing)
  • 8. ETL ➔Need to ingest data from different sources ➔Transform the data into a format for optimized storage and easy queriability ➔Query interface for different consumers
  • 9. ETL - Ingestion ➔Naive -- Have custom ingestion flows ➔Quick to develop ➔Could be highly optimized ➔Not scalable ➔Have a generic framework ➔Streamlined and scalable ➔Might need more processing
  • 10. ETL - Storage ➔Naive -- Storage schema closely coupled with ingestion schema ➔Multiple representations of same data. Age could be DOB or years ➔Consitent representation a must ➔Would require transformation from input schema to storage schema
  • 11. ETL - Storage ➔Location – Lat/Long, Zip, City, Country ➔Need to store in the lowest possible granularity (Lat/Long) ➔GPS readings come with accuracy that needs to be recorded ➔Queries are almost always nearness queries, not exact matches ➔
  • 12. ETL - Storage ➔Quadtile representation ➔Use leading bits for tile id, remaining for storing accuracy ➔Transform all location information to such ids ➔Nearness with Lat/Long distance is a cross-product join ➔With Tiles, we can translate this into equi-joins (of course with some loss of accuracy)
  • 13. ETL - Querying ➔Naive -- Users aware of multiple feeds and schemas, query appropriately ➔Extremely difficult as schemas change, new feeds get added ➔Closely coupled with internal representation, not good
  • 14. ETL - Querying ➔Having a consistent, published schema ➔Enables exploration and discovery ➔Well defined querying interfaces that abstract out internal representation ➔Provide primitives (for example UDFs for nearness calculations) for easier querying
  • 15.
  • 16. Ingestion Server ● Curation to filter out dubious records ● Adapters for transformation ● REST based ingestion server – Support multiple compression types – Support multiple serialization formats – Handle rate-limiting/throttling – Bulk/Streaming inputs ●
  • 17. Storage and Querying ● Possibly different schema than ingestion schema ● Columnar storage format (Parquet/ORC) ● Predominantly Hive friendly ● No direct access to internal storage, access only through a HQL-like query layer ● Export option for other use case (online store)
  • 18. Tech Stack ● Pig for most pipeline tasks ● Grill for analytics interface ● Hive as the primary execution engine ● Tez as the runtime environment ● ORC/Parquet for the storage format ●