More Related Content
Similar to Big Data at Tube: Events to Insights to Action (20)
Big Data at Tube: Events to Insights to Action
- 1. Big Data at Tube
(Events → Insights → Actions)
27th April 2016
@John Trenkle (Chief Scientist)
@Murtaza Doctor (Director of Engineering, RTB)
- 2. ©2016 TubeMogul Inc. All rights reserved.
• Where do we fit?
• What do we do?
• Life of a video Ad
• RTB Architecture
• Events Architecture
• ML Perspective: Transactional -> User-Oriented
• Data -> Models
• Models -> Action
Outline
- 5. ©2016 TubeMogul Inc. All rights reserved.
Scale:
An enterprise software company for digital branding
● Processed over 12.6 Trillion Ad Auctions in 2015
● Serve over 55 billion auctions per day
● Served over 3 Billion Ad Impressions on linear TV via our PTV solution
● Process bids in < 50 ms
● Serve bid responses in < 80 ms (includes network round-trip)
● Serve 5 PB of monthly video traffic
- 7. ©2016 TubeMogul Inc. All rights reserved.
Technical Overview
Bidding Layer
Ad
Serving
- High Volumes
- Low Latency
- Small Packets
- Large Data Sets
- Low Latency
- Fast Processing
- Large Caches
Low Latency User
Database for User
Targeting and Frequency
Capping
- 8. ©2016 TubeMogul Inc. All rights reserved.
Events Architecture:
● Auctions (Bids + Non Bids)
● Win Events (Impressions)
● Columnar format (ORC)
● Data Pipeline?
● Bad data?
● Scaling challenges
● Multiple downstream consumers
- 10. ©2016 TubeMogul Inc. All rights reserved.
Events Architecture: Takeaways
● Simply and Unify
● Focus on Data Validation at each step
● Automated recovery
● Leverage the messaging system for status or completion
● Metrics & Measurement for SLA
- 11. ©2016 TubeMogul Inc. All rights reserved.
Machine-Learning as a Consumer
• Audience Modeling begets user-oriented data
• Pivot RTB / Analytics sources for model-building
• Many sources of Truth that need to be integrated
• Ad Interaction
• Characterize Users with robust signature (UU-Code) rather than just an item list
• Facilitate rapid prototyping and model-building
• Maintain enriched information for exploratory analysis and visualization
• Insights
• Actionable Intel
- 12. ©2016 TubeMogul Inc. All rights reserved.
Ad Calls to User-Traces in Hive (on path to NoSQL)
Hive
RTB Ad
Calls
RTB
Digest
User
Activity
NoSQL
RTB Ad
Calls
User
Activity
Elastic
Search
- 13. ©2016 TubeMogul Inc. All rights reserved.
Token Embedding Models and Spark
http://deepdist.com/
Ref: http://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf
- 14. ©2016 TubeMogul Inc. All rights reserved.
Cascading for Signatures
1. JOIN on
tm_client
2. Filter
average weight
per verticals <
0.5
Daily Users
Activities
Prefixed
Daily UUCode
Creation Process
Daily
UUCodes
TM Client
Daily Activity3
Get Truth Users By
LAL Segment
Daily Truth
Users for all
LAL segment
Centroid Creation
Process
LAL
Landmarks
Segment
Creation
Process
User
Membership
Unfiltered
UUCode
Model
TM Daily
Converters
Convs LAL
segments from
Mario
User
Membership
Attach SourceID
Process
Daily
UUCodes with
Source ID
TMClientID
SourceID
Lookup
Aggregated
UUCode Creation
Process
UU Code
TM Client
Digest3
Create SourceID
Lookup Process
Wormhole
Process
Segment
Filter
Process
~650GB
UDB Team
Persistent Users
Table
- 15. ©2016 TubeMogul Inc. All rights reserved.
Large-Scale Predictive Model Building
Get Truth Users,
signature
Data
Warehouse
Of truth users
Training Data
Creation
Training
Data for
segments
Ground Truth
For each
segment, perform
training
Check
performance, log
in mysql for
tracking
purposes.
Model/
weights file
for each
segment
Aggregate and
Convert to
UUCode
UU Code
Model
3 months
aggregatio
n
Segment Information
Dashboard
UI
- 16. ©2016 TubeMogul Inc. All rights reserved.
Partners that have Contributed to Our Ecosystem
• Qubole
• Long-time partners
• Great for Ad Hoc queries and scheduled ETL
• Dynamic Scaling
• Snowflake
• Data Warehouse – facilitates Fraud Analysis
• SpotInst
• Cost effective Spot Instances in EMR
• Robust provisioning
• Dynamic Scaling
• Driven
• Monitor, optimize and debug Hadoop flows
- 17. ©2016 TubeMogul Inc. All rights reserved.
Since Hive has been our primary datastore for a while…
• Tips and tricks
• ORC
• MAPJOIN
• Sorted, Bucketed JOINs
• TRANSFORM
• HAVING
• Hadoop Streaming
- 18. ©2016 TubeMogul Inc. All rights reserved.
Models → Action
• Optimization
• Surrogate measures of engagement: Clicks, Completions, Conversions
• Audience Building for Targeting
• Demographic
• Behavioral
• Fraud Detection
• Cross Device Synching
• Profiling / Data Mining / Actionable Intel