2. What is Tapad?
2
Tapad is the first digital advertising solution for real-time mobile audience buying and multi-
screen targeting.
Marketers use Tapad to obtain a unified view of their customers across smartphones,
tablets, computers and smart TVs, enabling more relevant and device-specific messaging.
Tapad bridges devices together to create the Device Graph which enables Cross Platform
Targeting and Analytics
3. Device Graph Targeting Capabilities
Retargeting
- Retarget PC visitors on mobile or tablet
Location Targeting
- Geo-Fencing
- Airport Targeting
Audience Targeting
- Economic (Income, Net Worth, Discretionary Income, Home Value, Charitable
Contributions, Invested Assets)
- Demographic (Age, Genders Present, Presence of Children, Ethnicity)
Platform Targeting
- Platform (PC Web, Mobile Web, In-App, Connected TV)
- Device (Android, Android Tablet, Blackberry, Computer, Feature phones, iPad, iPhone,
Palm, Symbian, Windows Phone)
- Carrier (AT&T Wireless, MetroPCS, Sprint, T-Mobile, TracFone, Verizon Wireless, etc.)
4. Data at Tapad
• MySQL
• “CRUD” – Tapad UI and Campaign Manager
• Redis
• Counters – Revenue, Bid Requests, Impressions
• Aerospike
• Device Graph
• Vertica
• Impressions, Clicks, Aggregations - Reporting, ad-hoc queries
5. Use Case: Predict Available Monthly Impressions
for New Campaigns
How can we predict how many monthly impressions a new advertiser can buy on our
platform?
D1 D2
D3
Advertiser
Home
Page
1 – Pixel for D1
2 - Device Graph Propagation
3 – Bid Request for D2
MonthlyUniquesNewAdvertiser
MonthlyUniquesSimilarAdvertiser
*MonthlyBid RequestsSimilarAdvertiser
6. Bid Requests
At peak, we get over 150K bid requests/sec
High Volume/”Low Value” data
Complex data type (bid_sample_avro.json)
Not sure of all the ways we would query it
At a sampling rate of 1/1000, we are capturing 200MB/Hour
…in other words: Perfect for Hadoop
7. Hadoop Ecosystem
Hadoop Ecosystem – Heavily fragmented, lots of choices!
Trends
- “Distro Wars” – Cloudera vs Hortonworks vs MapR
- Real-time, interactive ad-hoc querying – aka “Faster Hive”
- Apache Drill, Cloudera Impala, Stinger Initiative (YARN, Tez, ORCFile)
- Many influenced by Google Dremel paper
- All are similar and seek to improve on M/R expensive start-up time, avoid
shuffle/sort disk serialization where possible, as well as unnecessary M/R pipelines.
- New languages/frameworks
- Many more choices than just Pig and Cascading
- Scalding, Scoobi, Spark, Crunch/Scrunch
- Many influenced by Google Flume paper, seek to avoid awkwardness of the UDF
programming model, and experiment with richer typed data models (not just tuples)
8. Tapad Hadoop POC
Some SQL, some code
POC
- Hive
- Familiar SQL syntax
- Easy to get started
- Hue/Beeswax makes SQL on Hadoop easy to non-programmers
- Impala (Cloudera)
- Most developed of the pack (as of Feb 2013)
- Scalding (Twitter)
- “A Scala API for Cascading”
- Algebird
- Cloudera CDH4
On our Radar
- Hortonworks – Stinger
- Scoobi
Also tried
- Shark/Spark
9. Serialization
Serialization Considerations:
- Parsing efficiency
- Schema evolution
- Compactness
- Complex type support
- Hadoop ecosystem support
CSV
JSON
Avro – Like Protocol Buffers/Thrift, but better:
- Dynamic typing – No code gen required
- Untagged data – Since schema included with data, smaller serialization size
- No manually-assigned field IDs – Schema migrations are a breeze with presence of old
and new schemas
10. Compression
Compression Considerations:
- Splittability
- Speed vs. Compression
- Hadoop ecosystem support
gzip
lzo
Snappy
- “…aims for very high speeds and reasonable compression”
- Integrates seamlessly with Avro
11. Hive Demo
CREATE TABLE bids
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'=‘<JSON SCHEMA HERE>’);
LOAD DATA LOCAL INPATH ‘bids.avro' INTO TABLE `bids`;