This document discusses online advertising and techniques for fitting large-scale models to advertising data. It outlines batch and online algorithms for logistic regression, including parallelizing existing batch algorithms and stochastic gradient descent. The document also discusses using alternating direction method of multipliers and follow the proximal regularized leader to fit models to large datasets across multiple machines. It provides examples of how major companies like LinkedIn and Facebook implement hybrid online-batch algorithms at large scale.
2. Outline
● Introduction of Online Advertising
● Handling Real Data
– Data Engineering
– Model Matrix
– Enhance Computation Speed of R
● Fitting Model to Large Scale Data
– Batch Algorithm – Parallelizing Existed Algorithm
– Online Algorithm – SGD, FTPRL and Learning Rate Schema
● Display Advertising Challenge
8. Why Online Advertising is Growing?
● Wide reach
● Target oriented
● Quick conversion
● Highly informative
● Cost-effective
● Easy to use
Measurable
Half the money I spend on advertising is wasted; the trouble is I
don't know which half.
9. How do we measure the online ad?
● The user behavior on the internet is trackable.
– We know who watches the ad.
– We know who buys the product.
● We collect data for measurement.
11. Performance-based advertising
● Pricing Model
– Cost-Per-Mille (CPM)
– Cost-Per-Click (CPC)
– Cost-Per-Action (CPA) or Cost-Per-Order (CPO)
12. To Improve Profit
● Display the ad with high Click-Through Rate(CTR) * CPC, or
Conversion Rate (CVR) * CPO
● Estimation of the probability of click (conversion) is the central
problem
– Rule Based
– Statistical Modeling (Machine Learning)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
12
10
8
6
4
2
0
13. System
Website Ad Request
Recommendation
Website Ad Delivering
Log Server
Batch
Model Fitting
Online
14. Rule Based
● Let the advertiser selects the target group
XX
15. Statistical Modeling
● We log the display and collect the response
● Features
– Ad
– Channel
– User
16. Features of Ad
● Ad type
– Text
– Figure
– Video
● Ad Content
– Fashion
– Health
– Game
19. Real Features
Zhang, Weinan and Yuan, Shuai and Wang, Jun and Shen, Xuehua. Real-Time Bidding Benchmarking
with iPinYou Dataset
20. Know How v.s. Know Why
● We usually do not study the reason of high CTR
● Little improvement of accuracy implies large improvement of
profit
● Predictive Analysis
21. Data
● School
– Static
– Cleaned
– Public
● Commercial
– Dynamic
– Error
– Private
23. Data Engineering with R
http://wush978.github.io/REngineering/
● Automation of R Jobs
– Convert R script to command line application
– Learn modern tools such as jenkins
● Connections between multiple machine
– Learn ssh
● Logging
– Linux tools: bash redirection, tee
– R package: logging
● R Error Handling
– try, tryCatch
24. Characteristic of Data
● Rare Event
● Large Amount of Categorical Features
– Binning Numerical Features
● Features are highly correlated
● Some features occurs frequently, some occurs rarely
25. Common Statistical Model for CTR
● Logistic Regression ● Gradient Boosted Regression
Tree
– Check xgboost
26. Logistic Regression
P(Click| x)= 1
1+e−wT x=σ(wT x)
● Linear relationship with features
– Fast prediction
– (Relative) Fast Fitting
● Usually fit the model with L2 regularization
27. How large is the data?
● Instances: 10^9
● Binary features: 10^5
28. Subsampling
● Sampling is useful for:
– Data exploration
– Code testing
● Sampling might harm the accuracy (profit)
– Rare event
– Some features occurs frequently and some occurs rarely
● We do not subsample data so far
29. Sampling
● Olivier Chapelle, et. al. Simple and scalable response prediction
for display advertising.
32. Dense Matrix
● 10^9 instances
● 10^5 binary features
● 10^14 elements for model matrix
● Size: 4 * 10^14 bytes
– 400 TB
● In memory is about 10^3 faster than on disk
33. R and Large Scale Data
● R cannot handle large scale data
● R consumes lots of memory
36. Sparse Matrix
● The size of non-zero could be estimated by the number of
categorical variable
m∼109
n∼105
k∼101×109
Dense Matrix: 4×1014
List: 12×109
Compressed: 12×109 or 8×109+4×105
37. Sparse Matrix
● Sparse matrix is useful for:
– Large amount of categorical data
– Text Analysis
– Tag Analysis
40. Advanced tips: package Rcpp
● C/C++ uses memory more efficiently
● Rcpp provides easy interface for R and C/C++
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
SEXP XTv(S4 m, NumericVector v, NumericVector&
retval) {
//...
}
41. Two approach of fitting logistic
regression to large-scaled data
● Batch Algorithm
– Optimize the log likelihood
globally
● Online Algorithm
– Optimize the loss function
instance per instance
42. Batch Algorithm
Negative Loglikelihood:
f (w∣( x1, y1) ,⋯,( xm , ym))
m
−yt log(σ(wT xt))−(1−yt)log(1−σ(wT xt))
=Σt
=1
Gradient Decent:
wt+1=wt−η∇ f (wt)
Each update requires scanning all data
43. Parallelize Existed Batch Algorithm
Rowise Partition
(X1
X2)v=(X1 v
X2 v)
(v1 v2)(X1
X2)=v1 X1+v2 X2
● We could split data by instances to several machines
● The matrix-vector multiplication could be parallelized
44. Framework of Parallelization
● Hadoop
– Slow for iterative
algorithm
– False tolerance
– Good for many machines
● MPI
– If in memory, fast for
iterative algorithm
– No false tolerance
– Good for several machines
46. R Package: pbdMPI
● Easy to install (on ubuntu)
– sudo apt-get install openmpi-bin openmpi-common
libopenmpi-dev
– install.packages("pbdMPI")
● Easy to develop (compared to Rmpi)
47. R Package: pbdMPI
library(pbdMPI)
.rank <- comm.rank()
filename <- sprintf("%d.csv", .rank)
data <- read.csv(filename)
target <- reduce(sum(data$value), op="sum")
finalize()
48. Parallelize Algorithm with pbdMPI
● Implement functions required for optimization with pbdMPI
– optim requires f and g (gradient of f)
– nlminb requires f, g, and H(hessian of f)
– tron requires f, g, and Hs(H multiply a given vector s)
49. Some Tips of Optimization
● Take care of stopping criteria
– A relative threshold might be enough
● Save the coefficient during iteration and print the value of f and
g with operator <<-
– You can stop the iteration anytime
– Monitor the convergence
51. LinkedIn Way
Deepak Agarwal. Computational Advertising: The LinkedIn Way. CIKM 2013
● Too many data to fit in single machine
– Billions of observations, million of features
● A Naive Approach
– Partition the data and run logistic regression for each partition
– Take the mean of the learned coefficients
– Problem: Not guaranteed to converge to the model from single machine!
● Alternating Direction Method of Multipliers (ADMM)
– Boyd et al. 2011 (based on earlier work from the 70s)
52. ADMM
For each nodes, the data and coefficients are different
K
f k (wk )+λ2∥w∥2 2
Σk
=1
subject to wk=w∀k
k =argminwk f k (wk )+
wt+1
ρ2
∥wk−wt+ut
k∥2 2
wt+1=argminw λ2∥w∥2 2
+
ρ2
K
∥wt+1
Σk
=1
k −w+ut
k∥2 2
k =ut
ut+1
k+wt+1
k −wt+1
55. Our Remark of ADMM
● ADMM saves the communication between the nodes
● In our environment, the overhead of communication is
affordable
– ADMM does not enhance the performance of our system
56. Online Algorithm
Stochastic Gradient Decent(SGD):
f (w| yt , xt )
=−yt log(σ(wT xt ))−(1−yt )log(1−σ(wT xt ))
wt+1=wt−η∇ f (wt| yt , xt )
● Choose an initial value and learning rate
● Randomly shuffle the instance in the training set
● Scan the data and update the coefficient
– Repeat until an approximate minimum is obtained
57. SGD to Follow The Proximal
Regularized Leader
H. Brendan McMahan. Follow-the-Regularized-Leader and Mirror Descent:
Equivalence Theorems and L1 Regularization. AISTATS 2011
wt+1 = wt−ηt∇ f (wt| yt , xt)
= argminw∇ f (wt| yt , xt )T w+
1
2ηt
(w−wt )T (w−wt)
t
f (wi| yi , xi)
Let gt=f (wt| yt , xt ) and g1 :t=Σi
=1
T w+t λ1‖w‖1+
wt+1 = argminw g1 :t
t
‖w−wi‖2 2
λ2
2 Σi
=1
58. Regret of SGD
H. Brendan McMahan and Matthew Streeter. Adaptive Bound Optimization for
Online Convex Optimization. COLT 2010
T
f t (wt )−minwΣt
Regret :=Σt
=1
T
f t (w)
=1
Global learning rate achives regret bound O(D M√T )
D is the L2 diameter of the feasible set
M is the L2 bound of g
59. Regret of SGD
H. Brendan McMahan and Matthew Streeter. Adaptive Bound Optimization for
Online Convex Optimization. COLT 2010
Per-coordinate Learning Rate:
ηt ,i= α
β+√Σs
t
gs ,i
=1
2
achieves regret bound O(√T n
1−γ
2 )
n is the dimension of w. If w∈[−0.5,0.5]n ,D=√n
P(xt ,i=1)∼i−γ for some γ∈[1,2)
60. Comparison of Learning Rate Schema
Xinran He, et. al. Practical Lessons from Predicting Clicks on Ads at Facebook.
ADKDD 2014.
61. Google KDD 2013, FTPRL
H. Brendan McMahan, et. al. Ad Click Prediction: a View from the Trenches. KDD
2013.
62. Some Remark for FTPRL
● FTPRL is a general optimization framework.
– We used it successfully to fit neuron network
● The per-coordinate learning rate greatly improves the
convergence on our data
– SGD works with per-coordinate learning rate
● The “Proximal” part decreases the accuracy, but introduces the
sparsity
63. Implementation of FTPRL in R
● I am not aware of any implementation of online optimization in
R
● The algorithm is simple. Just write it with a for loop.
● The overhead of loop is small in C/C++ compared to R
● I implemented the algorithm in
https://github.com/wush978/BridgewellML/tree/r-pkg
– Call for user
– Contact me if you want to try
65. Batch v.s. Online
Olivier Chapelle, et. al. Simple and scalable response prediction for display
advertising.
● Batch Algorithm
– Optimize the likelihood
function to a high accuracy
once they are in a good
neighborhood of the optimal
solution.
– Quite slow in reaching the
solution
– Straightforward to
generalize batch learning to
distributed environment
● Online Algorithm (mini-batch)
– Optimize the likelihood to
a rough precision quite fast
– A handful of passes over
the data.
– Tricky to parallelize
66. Criteo Inc. Hybrid of Online and Batch
● For each node, making one online pass over its local data
according to adaptive gradient updates.
● Average these local weights to be the initial value of L-BFGS.
67. Facebook
Xinran He, et. al. Practical Lessons from Predicting Clicks on Ads at Facebook.
ADKDD 2014.
● Decision Tree (Batch) for Feature Transforms
● Logistic Regression (Online)
73. Display Advertising Challenge
● https://www.kaggle.com/c/criteo-display-ad-challenge
● 7 * 10^7 instances
● 13 integer features and 26 categorical features with about 3 *
10^7 levels
● We were 9th over 718 teams
– We fit the neuron network (2-layer logistic regression) to the
data with FTPRL and dropout
74. Dropout in SGD
Geoffrey E. Hinton, et. al. Improving neural networks by preventing co-adaptation
of feature detectors. CoRR 2012
75. Tools of Large-scale Model Fitting
● Almost top 10 competitors were implemented algorithm by themselves
– There is no dominant tool for large-scale model fitting
● The winner used 20GB memory only. See
https://github.com/guestwalk/kaggle-2014-criteo
● For single machine, there are some good machine learning library
– LIBLINEAR for linear model (The student in the Lab is no.1)
– xgboost for gradient boosted regression tree (The author is no.12)
– Vowpal Wabbit