Hadoop at Tapad

HADOOP AT TAPAD
March 14, 2013
A Case Study
Mike Moss, VP Engineering
@michaelmoss

What is Tapad?
2
 Tapad is the first digital advertising solution for real-time mobile audience buying and multi-
screen targeting.
 Marketers use Tapad to obtain a unified view of their customers across smartphones,
tablets, computers and smart TVs, enabling more relevant and device-specific messaging.
 Tapad bridges devices together to create the Device Graph which enables Cross Platform
Targeting and Analytics

Device Graph Targeting Capabilities
 Retargeting
- Retarget PC visitors on mobile or tablet
 Location Targeting
- Geo-Fencing
- Airport Targeting
 Audience Targeting
- Economic (Income, Net Worth, Discretionary Income, Home Value, Charitable
Contributions, Invested Assets)
- Demographic (Age, Genders Present, Presence of Children, Ethnicity)
 Platform Targeting
- Platform (PC Web, Mobile Web, In-App, Connected TV)
- Device (Android, Android Tablet, Blackberry, Computer, Feature phones, iPad, iPhone,
Palm, Symbian, Windows Phone)
- Carrier (AT&T Wireless, MetroPCS, Sprint, T-Mobile, TracFone, Verizon Wireless, etc.)

Data at Tapad
• MySQL
• “CRUD” – Tapad UI and Campaign Manager
• Redis
• Counters – Revenue, Bid Requests, Impressions
• Aerospike
• Device Graph
• Vertica
• Impressions, Clicks, Aggregations - Reporting, ad-hoc queries

Use Case: Predict Available Monthly Impressions
for New Campaigns
 How can we predict how many monthly impressions a new advertiser can buy on our
platform?
D1 D2
D3
Advertiser
Home
Page
1 – Pixel for D1
2 - Device Graph Propagation
3 – Bid Request for D2
MonthlyUniquesNewAdvertiser
MonthlyUniquesSimilarAdvertiser
*MonthlyBid RequestsSimilarAdvertiser

Bid Requests
 At peak, we get over 150K bid requests/sec
 High Volume/”Low Value” data
 Complex data type (bid_sample_avro.json)
 Not sure of all the ways we would query it
 At a sampling rate of 1/1000, we are capturing 200MB/Hour
 …in other words: Perfect for Hadoop

Hadoop Ecosystem
 Hadoop Ecosystem – Heavily fragmented, lots of choices!
 Trends
- “Distro Wars” – Cloudera vs Hortonworks vs MapR
- Real-time, interactive ad-hoc querying – aka “Faster Hive”
- Apache Drill, Cloudera Impala, Stinger Initiative (YARN, Tez, ORCFile)
- Many influenced by Google Dremel paper
- All are similar and seek to improve on M/R expensive start-up time, avoid
shuffle/sort disk serialization where possible, as well as unnecessary M/R pipelines.
- New languages/frameworks
- Many more choices than just Pig and Cascading
- Scalding, Scoobi, Spark, Crunch/Scrunch
- Many influenced by Google Flume paper, seek to avoid awkwardness of the UDF
programming model, and experiment with richer typed data models (not just tuples)

Tapad Hadoop POC
 Some SQL, some code
 POC
- Hive
- Familiar SQL syntax
- Easy to get started
- Hue/Beeswax makes SQL on Hadoop easy to non-programmers
- Impala (Cloudera)
- Most developed of the pack (as of Feb 2013)
- Scalding (Twitter)
- “A Scala API for Cascading”
- Algebird
- Cloudera CDH4
 On our Radar
- Hortonworks – Stinger
- Scoobi
 Also tried
- Shark/Spark

Serialization
 Serialization Considerations:
- Parsing efficiency
- Schema evolution
- Compactness
- Complex type support
- Hadoop ecosystem support
 CSV
 JSON
 Avro – Like Protocol Buffers/Thrift, but better:
- Dynamic typing – No code gen required
- Untagged data – Since schema included with data, smaller serialization size
- No manually-assigned field IDs – Schema migrations are a breeze with presence of old
and new schemas

Compression
 Compression Considerations:
- Splittability
- Speed vs. Compression
- Hadoop ecosystem support
 gzip
 lzo
 Snappy
- “…aims for very high speeds and reasonable compression”
- Integrates seamlessly with Avro

Hive Demo
CREATE TABLE bids
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'=‘<JSON SCHEMA HERE>’);
LOAD DATA LOCAL INPATH ‘bids.avro' INTO TABLE `bids`;

Scalding
UnpackedAvroSource(args("input"), schema = None)
.read
.flatMapTo('request -> 'audienceId) { record: Tuple =>
val request: Tuple = record.getObject(0).asInstanceOf[Tuple]
val device: Option[Tuple] = Option(request.getObject(6).asInstanceOf[Tuple])
val audienceRecords: Option[ArrayList[Tuple]] = device.flatMap { record =>
Option(record.getObject(7).asInstanceOf[ArrayList[Tuple]])
}
audienceRecords.toSeq.flatMap { records =>
records.asScala.map(_.getString(0))
}
}
.groupBy('audienceId) { _.size('count) }
.groupAll { _.sortBy('count) }
.debug
.write(Tsv(args("output")))

Hardware
14
 1 Master Node – 1U
- 2 x Intel Xeon E5-2620 6-Core 2GHz
- 64GB DDR-1600 RAM
- LSI 9240-8i 8-Port RAID Card
- 2 x 1TB Seagate Constellation.2 SAS
 3 Data Nodes – 2U 12 HD Bays
- 2 x Intel Xeon E5-2620 6-Core 2GHz
- 64GB DDR-1600 RAM
- LSI 9207-8i 8-Port RAID Card
- OS Drive: 100GB Intel DC 3700
- Data Drives: 12 x 3TB Seagate Constellation CS SATA

References
15
Cloudera vs. Hortonworks: http://wikibon.org/wiki/v/The_Hadoop_Wars:_Cloudera_and_Hortonworks%E2%80%99_Death_Match_for_Mindshare
Dremel:
http://research.google.com/pubs/pub36632.html
http://www.quora.com/How-will-Googles-Dremel-change-future-Hadoop-releases
FlumeJava: http://faculty.neu.edu.cn/cc/zhangyf/cloud-bigdata/papers/big%20data%20programming/FlumeJava-pldi-2010.pdf
Hadoop Ecosystem (Mar 2013): http://gigaom.com/2013/03/05/the-hadoop-ecosystem-the-welcome-elephant-in-the-room-infographic/
Hardware:
http://hortonworks.com/blog/why-not-raid-0-its-about-time-and-snowflakes/
http://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/
Impala: https://ccp.cloudera.com/display/IMPALA10BETADOC/Impala+Frequently+Asked+Questions
Spark/Shark: http://www.cs.berkeley.edu/~matei/talks/2012/hadoop_summit_spark.pdf
Stinger: http://hortonworks.com/blog/100x-faster-hive/
SQL on Hadoop: http://gigaom.com/2013/02/21/sql-is-whats-next-for-hadoop-heres-whos-doing-it/
Tuples vs. Complex Types: http://www.quora.com/Apache-Hadoop/What-are-the-differences-between-Crunch-and-Cascading

Thank You
16
 Questions?
 Tapad is hiring!
- Data Scientists, Platform/Data/Frontend Engineers
- http://www.tapad.com/careers/
- michael.moss@tapad.com

Hadoop at Tapad

Recomendados

Recomendados

Más contenido relacionado

Similar a Hadoop at Tapad

Similar a Hadoop at Tapad (20)

Más de Open Analytics

Más de Open Analytics (20)

Último

Último (20)

Hadoop at Tapad