Online advertising and large scale model fitting

Online Advertising and
Large-scale model fitting
Wush Wu
2014-10-24

Outline
● Introduction of Online Advertising
● Handling Real Data
– Data Engineering
– Model Matrix
– Enhance Computation Speed of R
● Fitting Model to Large Scale Data
– Batch Algorithm – Parallelizing Existed Algorithm
– Online Algorithm – SGD, FTPRL and Learning Rate Schema
● Display Advertising Challenge

Ad Formats – Pre Roll Video Ads

Ad Formats – Banner/ Display Ads

Online Advertising is Growing
Rapidly

Why Online Advertising is Growing?
● Wide reach
● Target oriented
● Quick conversion
● Highly informative
● Cost-effective
● Easy to use
Measurable
Half the money I spend on advertising is wasted; the trouble is I
don't know which half.

How do we measure the online ad?
● The user behavior on the internet is trackable.
– We know who watches the ad.
– We know who buys the product.
● We collect data for measurement.

Performance-based advertising
● Pricing Model
– Cost-Per-Mille (CPM)
– Cost-Per-Click (CPC)
– Cost-Per-Action (CPA) or Cost-Per-Order (CPO)

To Improve Profit
● Display the ad with high Click-Through Rate(CTR) * CPC, or
Conversion Rate (CVR) * CPO
● Estimation of the probability of click (conversion) is the central
problem
– Rule Based
– Statistical Modeling (Machine Learning)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
12
10
8
6
4
2
0

System
Website Ad Request
Recommendation
Website Ad Delivering
Log Server
Batch
Model Fitting
Online

Rule Based
● Let the advertiser selects the target group
XX

Statistical Modeling
● We log the display and collect the response
● Features
– Ad
– Channel
– User

Features of Ad
● Ad type
– Text
– Figure
– Video
● Ad Content
– Fashion
– Health
– Game

Features of Channel
● Visibility

Features of User
● Sex
● Age
● Location
● Behavior

Real Features
Zhang, Weinan and Yuan, Shuai and Wang, Jun and Shen, Xuehua. Real-Time Bidding Benchmarking
with iPinYou Dataset

Know How v.s. Know Why
● We usually do not study the reason of high CTR
● Little improvement of accuracy implies large improvement of
profit
● Predictive Analysis

Data
● School
– Static
– Cleaned
– Public
● Commercial
– Dynamic
– Error
– Private

Data Engineering
Impression
Click
CLICK_TIME CLIENT_IP CLICKED ADID
2014/05/17 ... 2.17.x.x 133594
2014/05/17 ... 140.112.x.x 134811
+

Data Engineering with R
http://wush978.github.io/REngineering/
● Automation of R Jobs
– Convert R script to command line application
– Learn modern tools such as jenkins
● Connections between multiple machine
– Learn ssh
● Logging
– Linux tools: bash redirection, tee
– R package: logging
● R Error Handling
– try, tryCatch

Characteristic of Data
● Rare Event
● Large Amount of Categorical Features
– Binning Numerical Features
● Features are highly correlated
● Some features occurs frequently, some occurs rarely

Common Statistical Model for CTR
● Logistic Regression ● Gradient Boosted Regression
Tree
– Check xgboost

Logistic Regression
P(Click| x)= 1
1+e−wT x=σ(wT x)
● Linear relationship with features
– Fast prediction
– (Relative) Fast Fitting
● Usually fit the model with L2 regularization

How large is the data?
● Instances: 10^9
● Binary features: 10^5

Subsampling
● Sampling is useful for:
– Data exploration
– Code testing
● Sampling might harm the accuracy (profit)
– Rare event
– Some features occurs frequently and some occurs rarely
● We do not subsample data so far

Sampling
● Olivier Chapelle, et. al. Simple and scalable response prediction
for display advertising.

Computation
P(Click| x)= 1
wT x
1+e−wT x

Model Matrix
head(model.matrix(Species ~ ., iris))

Dense Matrix
● 10^9 instances
● 10^5 binary features
● 10^14 elements for model matrix
● Size: 4 * 10^14 bytes
– 400 TB
● In memory is about 10^3 faster than on disk

R and Large Scale Data
● R cannot handle large scale data
● R consumes lots of memory

Sparse Matrix
A∈ℝm×n and k nonzero elements
Dense Matrix:
[1 0 1 0
0 0 0 0
0 0 0 0
0 0 1 0] requires 4 mn size
List:
(1, 1,1) ,(1, 3,1) ,(4, 3,1) requires 12k size
Compressed List:
i :{1, 3,3 }, p: {2, 0, 0,1}, x :{1, 1,1} requires 8 k+4m size
j :{1,1, 4 }, p:{1, 0, 2,0}, x :{1,1,1} requires 8 k+4 n size

Sparse Matrix
● The size of non-zero could be estimated by the number of
categorical variable
m∼109
n∼105
k∼101×109
Dense Matrix: 4×1014
List: 12×109
Compressed: 12×109 or 8×109+4×105

Sparse Matrix
● Sparse matrix is useful for:
– Large amount of categorical data
– Text Analysis
– Tag Analysis

R package: Matrix
m1 <- matrix(0, 5, 5);m1[1, 4] <- 1
m1
library(Matrix)
m2 <- Matrix(0, 5, 5, sparse=TRUE)
m2[1,4] <- 1
m2

Computation Speed
m1 <- matrix(0, 5, 5);m1[1, 4] <- 1
library(Matrix)
m2 <- Matrix(0, 5, 5, sparse=TRUE)
m2[1,4] <- 1

Advanced tips: package Rcpp
● C/C++ uses memory more efficiently
● Rcpp provides easy interface for R and C/C++
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
SEXP XTv(S4 m, NumericVector v, NumericVector&
retval) {
//...
}

Two approach of fitting logistic
regression to large-scaled data
● Batch Algorithm
– Optimize the log likelihood
globally
● Online Algorithm
– Optimize the loss function
instance per instance

Batch Algorithm
Negative Loglikelihood:
f (w∣( x1, y1) ,⋯,( xm , ym))
m
−yt log(σ(wT xt))−(1−yt)log(1−σ(wT xt))
=Σt
=1
Gradient Decent:
wt+1=wt−η∇ f (wt)
Each update requires scanning all data

Parallelize Existed Batch Algorithm
Rowise Partition
(X1
X2)v=(X1 v
X2 v)
(v1 v2)(X1
X2)=v1 X1+v2 X2
● We could split data by instances to several machines
● The matrix-vector multiplication could be parallelized

Framework of Parallelization
● Hadoop
– Slow for iterative
algorithm
– False tolerance
– Good for many machines
● MPI
– If in memory, fast for
iterative algorithm
– No false tolerance
– Good for several machines

R Package: pbdMPI
● Easy to install (on ubuntu)
– sudo apt-get install openmpi-bin openmpi-common
libopenmpi-dev
– install.packages("pbdMPI")
● Easy to develop (compared to Rmpi)

R Package: pbdMPI
library(pbdMPI)
.rank <- comm.rank()
filename <- sprintf("%d.csv", .rank)
data <- read.csv(filename)
target <- reduce(sum(data$value), op="sum")
finalize()

Parallelize Algorithm with pbdMPI
● Implement functions required for optimization with pbdMPI
– optim requires f and g (gradient of f)
– nlminb requires f, g, and H(hessian of f)
– tron requires f, g, and Hs(H multiply a given vector s)

Some Tips of Optimization
● Take care of stopping criteria
– A relative threshold might be enough
● Save the coefficient during iteration and print the value of f and
g with operator <<-
– You can stop the iteration anytime
– Monitor the convergence

LinkedIn Way
Deepak Agarwal. Computational Advertising: The LinkedIn Way. CIKM 2013
● Too many data to fit in single machine
– Billions of observations, million of features
● A Naive Approach
– Partition the data and run logistic regression for each partition
– Take the mean of the learned coefficients
– Problem: Not guaranteed to converge to the model from single machine!
● Alternating Direction Method of Multipliers (ADMM)
– Boyd et al. 2011 (based on earlier work from the 70s)

ADMM
For each nodes, the data and coefficients are different
K
f k (wk )+λ2∥w∥2 2
Σk
=1
subject to wk=w∀k
k =argminwk f k (wk )+
wt+1
ρ2
∥wk−wt+ut
k∥2 2
wt+1=argminw λ2∥w∥2 2
+
ρ2
K
∥wt+1
Σk
=1
k −w+ut
k∥2 2
k =ut
ut+1
k+wt+1
k −wt+1

Update Coefficient
BBIIGG DDAATTAA
PPaarrttiittiioonn 11 PPaarrttiittiioonn 22 PPaarrttiittiioonn 33 PPaarrttiittiioonn KK
Logistic
Regression
Logistic
Regression
Logistic
Regression
Logistic
Regression
Consensus
Computation

Update Regularization
BBIIGG DDAATTAA
PPaarrttiittiioonn 11 PPaarrttiittiioonn 22 PPaarrttiittiioonn 33 PPaarrttiittiioonn KK
Logistic
Regression
Consensus
Computation
Logistic
Regression
Logistic
Regression
Logistic
Regression

Our Remark of ADMM
● ADMM saves the communication between the nodes
● In our environment, the overhead of communication is
affordable
– ADMM does not enhance the performance of our system

Online Algorithm
Stochastic Gradient Decent(SGD):
f (w| yt , xt )
=−yt log(σ(wT xt ))−(1−yt )log(1−σ(wT xt ))
wt+1=wt−η∇ f (wt| yt , xt )
● Choose an initial value and learning rate
● Randomly shuffle the instance in the training set
● Scan the data and update the coefficient
– Repeat until an approximate minimum is obtained

SGD to Follow The Proximal
Regularized Leader
H. Brendan McMahan. Follow-the-Regularized-Leader and Mirror Descent:
Equivalence Theorems and L1 Regularization. AISTATS 2011
wt+1 = wt−ηt∇ f (wt| yt , xt)
= argminw∇ f (wt| yt , xt )T w+
1
2ηt
(w−wt )T (w−wt)
t
f (wi| yi , xi)
Let gt=f (wt| yt , xt ) and g1 :t=Σi
=1
T w+t λ1‖w‖1+
wt+1 = argminw g1 :t
t
‖w−wi‖2 2
λ2
2 Σi
=1

Regret of SGD
H. Brendan McMahan and Matthew Streeter. Adaptive Bound Optimization for
Online Convex Optimization. COLT 2010
T
f t (wt )−minwΣt
Regret :=Σt
=1
T
f t (w)
=1
Global learning rate achives regret bound O(D M√T )
D is the L2 diameter of the feasible set
M is the L2 bound of g

Regret of SGD
H. Brendan McMahan and Matthew Streeter. Adaptive Bound Optimization for
Online Convex Optimization. COLT 2010
Per-coordinate Learning Rate:
ηt ,i= α
β+√Σs
t
gs ,i
=1
2
achieves regret bound O(√T n
1−γ
2 )
n is the dimension of w. If w∈[−0.5,0.5]n ,D=√n
P(xt ,i=1)∼i−γ for some γ∈[1,2)

Comparison of Learning Rate Schema
Xinran He, et. al. Practical Lessons from Predicting Clicks on Ads at Facebook.
ADKDD 2014.

Google KDD 2013, FTPRL
H. Brendan McMahan, et. al. Ad Click Prediction: a View from the Trenches. KDD
2013.

Some Remark for FTPRL
● FTPRL is a general optimization framework.
– We used it successfully to fit neuron network
● The per-coordinate learning rate greatly improves the
convergence on our data
– SGD works with per-coordinate learning rate
● The “Proximal” part decreases the accuracy, but introduces the
sparsity

Implementation of FTPRL in R
● I am not aware of any implementation of online optimization in
R
● The algorithm is simple. Just write it with a for loop.
● The overhead of loop is small in C/C++ compared to R
● I implemented the algorithm in
https://github.com/wush978/BridgewellML/tree/r-pkg
– Call for user
– Contact me if you want to try

Batch v.s. Online
Olivier Chapelle, et. al. Simple and scalable response prediction for display
advertising.
● Batch Algorithm
– Optimize the likelihood
function to a high accuracy
once they are in a good
neighborhood of the optimal
solution.
– Quite slow in reaching the
solution
– Straightforward to
generalize batch learning to
distributed environment
● Online Algorithm (mini-batch)
– Optimize the likelihood to
a rough precision quite fast
– A handful of passes over
the data.
– Tricky to parallelize

Criteo Inc. Hybrid of Online and Batch
● For each node, making one online pass over its local data
according to adaptive gradient updates.
● Average these local weights to be the initial value of L-BFGS.

Facebook
Xinran He, et. al. Practical Lessons from Predicting Clicks on Ads at Facebook.
ADKDD 2014.
● Decision Tree (Batch) for Feature Transforms
● Logistic Regression (Online)

Improving
New Models
New Algorithms
New Features
Experiments
Analysis

Display Advertising Challenge
● https://www.kaggle.com/c/criteo-display-ad-challenge
● 7 * 10^7 instances
● 13 integer features and 26 categorical features with about 3 *
10^7 levels
● We were 9th over 718 teams
– We fit the neuron network (2-layer logistic regression) to the
data with FTPRL and dropout

Dropout in SGD
Geoffrey E. Hinton, et. al. Improving neural networks by preventing co-adaptation
of feature detectors. CoRR 2012

Tools of Large-scale Model Fitting
● Almost top 10 competitors were implemented algorithm by themselves
– There is no dominant tool for large-scale model fitting
● The winner used 20GB memory only. See
https://github.com/guestwalk/kaggle-2014-criteo
● For single machine, there are some good machine learning library
– LIBLINEAR for linear model (The student in the Lab is no.1)
– xgboost for gradient boosted regression tree (The author is no.12)
– Vowpal Wabbit

Online advertising and large scale model fitting

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (15)

Destacado

Destacado (20)

Similar a Online advertising and large scale model fitting

Similar a Online advertising and large scale model fitting (20)

Más de Wush Wu

Más de Wush Wu (7)

Último

Último (20)

Online advertising and large scale model fitting