SlideShare una empresa de Scribd logo
1 de 16
HADOOP AT TAPAD
March 14, 2013
A Case Study
Mike Moss, VP Engineering
@michaelmoss
What is Tapad?
2
 Tapad is the first digital advertising solution for real-time mobile audience buying and multi-
screen targeting.
 Marketers use Tapad to obtain a unified view of their customers across smartphones,
tablets, computers and smart TVs, enabling more relevant and device-specific messaging.
 Tapad bridges devices together to create the Device Graph which enables Cross Platform
Targeting and Analytics
Device Graph Targeting Capabilities
 Retargeting
- Retarget PC visitors on mobile or tablet
 Location Targeting
- Geo-Fencing
- Airport Targeting
 Audience Targeting
- Economic (Income, Net Worth, Discretionary Income, Home Value, Charitable
Contributions, Invested Assets)
- Demographic (Age, Genders Present, Presence of Children, Ethnicity)
 Platform Targeting
- Platform (PC Web, Mobile Web, In-App, Connected TV)
- Device (Android, Android Tablet, Blackberry, Computer, Feature phones, iPad, iPhone,
Palm, Symbian, Windows Phone)
- Carrier (AT&T Wireless, MetroPCS, Sprint, T-Mobile, TracFone, Verizon Wireless, etc.)
Data at Tapad
• MySQL
• “CRUD” – Tapad UI and Campaign Manager
• Redis
• Counters – Revenue, Bid Requests, Impressions
• Aerospike
• Device Graph
• Vertica
• Impressions, Clicks, Aggregations - Reporting, ad-hoc queries
Use Case: Predict Available Monthly Impressions
for New Campaigns
 How can we predict how many monthly impressions a new advertiser can buy on our
platform?
D1 D2
D3
Advertiser
Home
Page
1 – Pixel for D1
2 - Device Graph Propagation
3 – Bid Request for D2
MonthlyUniquesNewAdvertiser
MonthlyUniquesSimilarAdvertiser
*MonthlyBid RequestsSimilarAdvertiser
Bid Requests
 At peak, we get over 150K bid requests/sec
 High Volume/”Low Value” data
 Complex data type (bid_sample_avro.json)
 Not sure of all the ways we would query it
 At a sampling rate of 1/1000, we are capturing 200MB/Hour
 …in other words: Perfect for Hadoop
Hadoop Ecosystem
 Hadoop Ecosystem – Heavily fragmented, lots of choices!
 Trends
- “Distro Wars” – Cloudera vs Hortonworks vs MapR
- Real-time, interactive ad-hoc querying – aka “Faster Hive”
- Apache Drill, Cloudera Impala, Stinger Initiative (YARN, Tez, ORCFile)
- Many influenced by Google Dremel paper
- All are similar and seek to improve on M/R expensive start-up time, avoid
shuffle/sort disk serialization where possible, as well as unnecessary M/R pipelines.
- New languages/frameworks
- Many more choices than just Pig and Cascading
- Scalding, Scoobi, Spark, Crunch/Scrunch
- Many influenced by Google Flume paper, seek to avoid awkwardness of the UDF
programming model, and experiment with richer typed data models (not just tuples)
Tapad Hadoop POC
 Some SQL, some code
 POC
- Hive
- Familiar SQL syntax
- Easy to get started
- Hue/Beeswax makes SQL on Hadoop easy to non-programmers
- Impala (Cloudera)
- Most developed of the pack (as of Feb 2013)
- Scalding (Twitter)
- “A Scala API for Cascading”
- Algebird
- Cloudera CDH4
 On our Radar
- Hortonworks – Stinger
- Scoobi
 Also tried
- Shark/Spark
Serialization
 Serialization Considerations:
- Parsing efficiency
- Schema evolution
- Compactness
- Complex type support
- Hadoop ecosystem support
 CSV
 JSON
 Avro – Like Protocol Buffers/Thrift, but better:
- Dynamic typing – No code gen required
- Untagged data – Since schema included with data, smaller serialization size
- No manually-assigned field IDs – Schema migrations are a breeze with presence of old
and new schemas
Compression
 Compression Considerations:
- Splittability
- Speed vs. Compression
- Hadoop ecosystem support
 gzip
 lzo
 Snappy
- “…aims for very high speeds and reasonable compression”
- Integrates seamlessly with Avro
Hive Demo
CREATE TABLE bids
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'=‘<JSON SCHEMA HERE>’);
LOAD DATA LOCAL INPATH ‘bids.avro' INTO TABLE `bids`;
Impala Demo
Scalding
UnpackedAvroSource(args("input"), schema = None)
.read
.flatMapTo('request -> 'audienceId) { record: Tuple =>
val request: Tuple = record.getObject(0).asInstanceOf[Tuple]
val device: Option[Tuple] = Option(request.getObject(6).asInstanceOf[Tuple])
val audienceRecords: Option[ArrayList[Tuple]] = device.flatMap { record =>
Option(record.getObject(7).asInstanceOf[ArrayList[Tuple]])
}
audienceRecords.toSeq.flatMap { records =>
records.asScala.map(_.getString(0))
}
}
.groupBy('audienceId) { _.size('count) }
.groupAll { _.sortBy('count) }
.debug
.write(Tsv(args("output")))
Hardware
14
 1 Master Node – 1U
- 2 x Intel Xeon E5-2620 6-Core 2GHz
- 64GB DDR-1600 RAM
- LSI 9240-8i 8-Port RAID Card
- 2 x 1TB Seagate Constellation.2 SAS
 3 Data Nodes – 2U 12 HD Bays
- 2 x Intel Xeon E5-2620 6-Core 2GHz
- 64GB DDR-1600 RAM
- LSI 9207-8i 8-Port RAID Card
- OS Drive: 100GB Intel DC 3700
- Data Drives: 12 x 3TB Seagate Constellation CS SATA
References
15
Cloudera vs. Hortonworks: http://wikibon.org/wiki/v/The_Hadoop_Wars:_Cloudera_and_Hortonworks%E2%80%99_Death_Match_for_Mindshare
Dremel:
http://research.google.com/pubs/pub36632.html
http://www.quora.com/How-will-Googles-Dremel-change-future-Hadoop-releases
FlumeJava: http://faculty.neu.edu.cn/cc/zhangyf/cloud-bigdata/papers/big%20data%20programming/FlumeJava-pldi-2010.pdf
Hadoop Ecosystem (Mar 2013): http://gigaom.com/2013/03/05/the-hadoop-ecosystem-the-welcome-elephant-in-the-room-infographic/
Hardware:
http://hortonworks.com/blog/why-not-raid-0-its-about-time-and-snowflakes/
http://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/
Impala: https://ccp.cloudera.com/display/IMPALA10BETADOC/Impala+Frequently+Asked+Questions
Spark/Shark: http://www.cs.berkeley.edu/~matei/talks/2012/hadoop_summit_spark.pdf
Stinger: http://hortonworks.com/blog/100x-faster-hive/
SQL on Hadoop: http://gigaom.com/2013/02/21/sql-is-whats-next-for-hadoop-heres-whos-doing-it/
Tuples vs. Complex Types: http://www.quora.com/Apache-Hadoop/What-are-the-differences-between-Crunch-and-Cascading
Thank You
16
 Questions?
 Tapad is hiring!
- Data Scientists, Platform/Data/Frontend Engineers
- http://www.tapad.com/careers/
- michael.moss@tapad.com

Más contenido relacionado

Similar a Hadoop at Tapad

Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...Amazon Web Services
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"Nicola Ferraro
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewMadhur Nawandar
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopSteve Watt
 
Large Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache SparkLarge Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache SparkDatabricks
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Milos Milovanovic
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Darko Marjanovic
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache sparksarith divakar
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsgagravarr
 

Similar a Hadoop at Tapad (20)

Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Large Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache SparkLarge Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache Spark
 
Hadoop at Lookout
Hadoop at LookoutHadoop at Lookout
Hadoop at Lookout
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Spark!
Spark!Spark!
Spark!
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 

Más de Open Analytics

Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Open Analytics
 
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Open Analytics
 
CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)Open Analytics
 
An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)Open Analytics
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)Open Analytics
 
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Open Analytics
 
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationOpen Analytics
 
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsOpen Analytics
 
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital EconomyOpen Analytics
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Open Analytics
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics
 
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Open Analytics
 
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...Open Analytics
 
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Open Analytics
 
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Open Analytics
 
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)Open Analytics
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYCOpen Analytics
 
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupOpen Analytics
 
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupOpen Analytics
 
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalOpen Analytics
 

Más de Open Analytics (20)

Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)
 
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
 
CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)
 
An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
 
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
 
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & Personalization
 
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco Analytics
 
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital Economy
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)
 
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
 
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
 
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
 
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYC
 
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics Meetup
 
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetup
 
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_final
 

Último

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Último (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Hadoop at Tapad

  • 1. HADOOP AT TAPAD March 14, 2013 A Case Study Mike Moss, VP Engineering @michaelmoss
  • 2. What is Tapad? 2  Tapad is the first digital advertising solution for real-time mobile audience buying and multi- screen targeting.  Marketers use Tapad to obtain a unified view of their customers across smartphones, tablets, computers and smart TVs, enabling more relevant and device-specific messaging.  Tapad bridges devices together to create the Device Graph which enables Cross Platform Targeting and Analytics
  • 3. Device Graph Targeting Capabilities  Retargeting - Retarget PC visitors on mobile or tablet  Location Targeting - Geo-Fencing - Airport Targeting  Audience Targeting - Economic (Income, Net Worth, Discretionary Income, Home Value, Charitable Contributions, Invested Assets) - Demographic (Age, Genders Present, Presence of Children, Ethnicity)  Platform Targeting - Platform (PC Web, Mobile Web, In-App, Connected TV) - Device (Android, Android Tablet, Blackberry, Computer, Feature phones, iPad, iPhone, Palm, Symbian, Windows Phone) - Carrier (AT&T Wireless, MetroPCS, Sprint, T-Mobile, TracFone, Verizon Wireless, etc.)
  • 4. Data at Tapad • MySQL • “CRUD” – Tapad UI and Campaign Manager • Redis • Counters – Revenue, Bid Requests, Impressions • Aerospike • Device Graph • Vertica • Impressions, Clicks, Aggregations - Reporting, ad-hoc queries
  • 5. Use Case: Predict Available Monthly Impressions for New Campaigns  How can we predict how many monthly impressions a new advertiser can buy on our platform? D1 D2 D3 Advertiser Home Page 1 – Pixel for D1 2 - Device Graph Propagation 3 – Bid Request for D2 MonthlyUniquesNewAdvertiser MonthlyUniquesSimilarAdvertiser *MonthlyBid RequestsSimilarAdvertiser
  • 6. Bid Requests  At peak, we get over 150K bid requests/sec  High Volume/”Low Value” data  Complex data type (bid_sample_avro.json)  Not sure of all the ways we would query it  At a sampling rate of 1/1000, we are capturing 200MB/Hour  …in other words: Perfect for Hadoop
  • 7. Hadoop Ecosystem  Hadoop Ecosystem – Heavily fragmented, lots of choices!  Trends - “Distro Wars” – Cloudera vs Hortonworks vs MapR - Real-time, interactive ad-hoc querying – aka “Faster Hive” - Apache Drill, Cloudera Impala, Stinger Initiative (YARN, Tez, ORCFile) - Many influenced by Google Dremel paper - All are similar and seek to improve on M/R expensive start-up time, avoid shuffle/sort disk serialization where possible, as well as unnecessary M/R pipelines. - New languages/frameworks - Many more choices than just Pig and Cascading - Scalding, Scoobi, Spark, Crunch/Scrunch - Many influenced by Google Flume paper, seek to avoid awkwardness of the UDF programming model, and experiment with richer typed data models (not just tuples)
  • 8. Tapad Hadoop POC  Some SQL, some code  POC - Hive - Familiar SQL syntax - Easy to get started - Hue/Beeswax makes SQL on Hadoop easy to non-programmers - Impala (Cloudera) - Most developed of the pack (as of Feb 2013) - Scalding (Twitter) - “A Scala API for Cascading” - Algebird - Cloudera CDH4  On our Radar - Hortonworks – Stinger - Scoobi  Also tried - Shark/Spark
  • 9. Serialization  Serialization Considerations: - Parsing efficiency - Schema evolution - Compactness - Complex type support - Hadoop ecosystem support  CSV  JSON  Avro – Like Protocol Buffers/Thrift, but better: - Dynamic typing – No code gen required - Untagged data – Since schema included with data, smaller serialization size - No manually-assigned field IDs – Schema migrations are a breeze with presence of old and new schemas
  • 10. Compression  Compression Considerations: - Splittability - Speed vs. Compression - Hadoop ecosystem support  gzip  lzo  Snappy - “…aims for very high speeds and reasonable compression” - Integrates seamlessly with Avro
  • 11. Hive Demo CREATE TABLE bids ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ('avro.schema.literal'=‘<JSON SCHEMA HERE>’); LOAD DATA LOCAL INPATH ‘bids.avro' INTO TABLE `bids`;
  • 13. Scalding UnpackedAvroSource(args("input"), schema = None) .read .flatMapTo('request -> 'audienceId) { record: Tuple => val request: Tuple = record.getObject(0).asInstanceOf[Tuple] val device: Option[Tuple] = Option(request.getObject(6).asInstanceOf[Tuple]) val audienceRecords: Option[ArrayList[Tuple]] = device.flatMap { record => Option(record.getObject(7).asInstanceOf[ArrayList[Tuple]]) } audienceRecords.toSeq.flatMap { records => records.asScala.map(_.getString(0)) } } .groupBy('audienceId) { _.size('count) } .groupAll { _.sortBy('count) } .debug .write(Tsv(args("output")))
  • 14. Hardware 14  1 Master Node – 1U - 2 x Intel Xeon E5-2620 6-Core 2GHz - 64GB DDR-1600 RAM - LSI 9240-8i 8-Port RAID Card - 2 x 1TB Seagate Constellation.2 SAS  3 Data Nodes – 2U 12 HD Bays - 2 x Intel Xeon E5-2620 6-Core 2GHz - 64GB DDR-1600 RAM - LSI 9207-8i 8-Port RAID Card - OS Drive: 100GB Intel DC 3700 - Data Drives: 12 x 3TB Seagate Constellation CS SATA
  • 15. References 15 Cloudera vs. Hortonworks: http://wikibon.org/wiki/v/The_Hadoop_Wars:_Cloudera_and_Hortonworks%E2%80%99_Death_Match_for_Mindshare Dremel: http://research.google.com/pubs/pub36632.html http://www.quora.com/How-will-Googles-Dremel-change-future-Hadoop-releases FlumeJava: http://faculty.neu.edu.cn/cc/zhangyf/cloud-bigdata/papers/big%20data%20programming/FlumeJava-pldi-2010.pdf Hadoop Ecosystem (Mar 2013): http://gigaom.com/2013/03/05/the-hadoop-ecosystem-the-welcome-elephant-in-the-room-infographic/ Hardware: http://hortonworks.com/blog/why-not-raid-0-its-about-time-and-snowflakes/ http://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/ Impala: https://ccp.cloudera.com/display/IMPALA10BETADOC/Impala+Frequently+Asked+Questions Spark/Shark: http://www.cs.berkeley.edu/~matei/talks/2012/hadoop_summit_spark.pdf Stinger: http://hortonworks.com/blog/100x-faster-hive/ SQL on Hadoop: http://gigaom.com/2013/02/21/sql-is-whats-next-for-hadoop-heres-whos-doing-it/ Tuples vs. Complex Types: http://www.quora.com/Apache-Hadoop/What-are-the-differences-between-Crunch-and-Cascading
  • 16. Thank You 16  Questions?  Tapad is hiring! - Data Scientists, Platform/Data/Frontend Engineers - http://www.tapad.com/careers/ - michael.moss@tapad.com