Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Próxima SlideShare
Cargando en…5
×

de

Próxima SlideShare
I am Flexi
Siguiente
Descargar para leer sin conexión y ver en pantalla completa.

21

Compartir

# Online advertising and large scale model fitting

Introducing my studying to my teachers and schoolmates in Statistical Institute of NTHU, Taiwan.

Ver todo

Ver todo

### Online advertising and large scale model fitting

1. 1. Online Advertising and Large-scale model fitting Wush Wu 2014-10-24
2. 2. Outline ● Introduction of Online Advertising ● Handling Real Data – Data Engineering – Model Matrix – Enhance Computation Speed of R ● Fitting Model to Large Scale Data – Batch Algorithm – Parallelizing Existed Algorithm – Online Algorithm – SGD, FTPRL and Learning Rate Schema ● Display Advertising Challenge
7. 7. Online Advertising is Growing Rapidly
8. 8. Why Online Advertising is Growing? ● Wide reach ● Target oriented ● Quick conversion ● Highly informative ● Cost-effective ● Easy to use Measurable Half the money I spend on advertising is wasted; the trouble is I don't know which half.
9. 9. How do we measure the online ad? ● The user behavior on the internet is trackable. – We know who watches the ad. – We know who buys the product. ● We collect data for measurement.
10. 10. How do we collect the data?
11. 11. Performance-based advertising ● Pricing Model – Cost-Per-Mille (CPM) – Cost-Per-Click (CPC) – Cost-Per-Action (CPA) or Cost-Per-Order (CPO)
12. 12. To Improve Profit ● Display the ad with high Click-Through Rate(CTR) * CPC, or Conversion Rate (CVR) * CPO ● Estimation of the probability of click (conversion) is the central problem – Rule Based – Statistical Modeling (Machine Learning) 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 12 10 8 6 4 2 0
13. 13. System Website Ad Request Recommendation Website Ad Delivering Log Server Batch Model Fitting Online
14. 14. Rule Based ● Let the advertiser selects the target group XX
15. 15. Statistical Modeling ● We log the display and collect the response ● Features – Ad – Channel – User
16. 16. Features of Ad ● Ad type – Text – Figure – Video ● Ad Content – Fashion – Health – Game
17. 17. Features of Channel ● Visibility
18. 18. Features of User ● Sex ● Age ● Location ● Behavior
19. 19. Real Features Zhang, Weinan and Yuan, Shuai and Wang, Jun and Shen, Xuehua. Real-Time Bidding Benchmarking with iPinYou Dataset
20. 20. Know How v.s. Know Why ● We usually do not study the reason of high CTR ● Little improvement of accuracy implies large improvement of profit ● Predictive Analysis
21. 21. Data ● School – Static – Cleaned – Public ● Commercial – Dynamic – Error – Private
22. 22. Data Engineering Impression Click CLICK_TIME CLIENT_IP CLICKED ADID 2014/05/17 ... 2.17.x.x 133594 2014/05/17 ... 140.112.x.x 134811 +
23. 23. Data Engineering with R http://wush978.github.io/REngineering/ ● Automation of R Jobs – Convert R script to command line application – Learn modern tools such as jenkins ● Connections between multiple machine – Learn ssh ● Logging – Linux tools: bash redirection, tee – R package: logging ● R Error Handling – try, tryCatch
24. 24. Characteristic of Data ● Rare Event ● Large Amount of Categorical Features – Binning Numerical Features ● Features are highly correlated ● Some features occurs frequently, some occurs rarely
25. 25. Common Statistical Model for CTR ● Logistic Regression ● Gradient Boosted Regression Tree – Check xgboost
26. 26. Logistic Regression P(Click| x)= 1 1+e−wT x=σ(wT x) ● Linear relationship with features – Fast prediction – (Relative) Fast Fitting ● Usually fit the model with L2 regularization
27. 27. How large is the data? ● Instances: 10^9 ● Binary features: 10^5
28. 28. Subsampling ● Sampling is useful for: – Data exploration – Code testing ● Sampling might harm the accuracy (profit) – Rare event – Some features occurs frequently and some occurs rarely ● We do not subsample data so far
29. 29. Sampling ● Olivier Chapelle, et. al. Simple and scalable response prediction for display advertising.
30. 30. Computation P(Click| x)= 1 wT x 1+e−wT x
31. 31. Model Matrix head(model.matrix(Species ~ ., iris))
32. 32. Dense Matrix ● 10^9 instances ● 10^5 binary features ● 10^14 elements for model matrix ● Size: 4 * 10^14 bytes – 400 TB ● In memory is about 10^3 faster than on disk
33. 33. R and Large Scale Data ● R cannot handle large scale data ● R consumes lots of memory
34. 34. Sparse Matrix
35. 35. Sparse Matrix A∈ℝm×n and k nonzero elements Dense Matrix: [1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0] requires 4 mn size List: (1, 1,1) ,(1, 3,1) ,(4, 3,1) requires 12k size Compressed List: i :{1, 3,3 }, p: {2, 0, 0,1}, x :{1, 1,1} requires 8 k+4m size j :{1,1, 4 }, p:{1, 0, 2,0}, x :{1,1,1} requires 8 k+4 n size
36. 36. Sparse Matrix ● The size of non-zero could be estimated by the number of categorical variable m∼109 n∼105 k∼101×109 Dense Matrix: 4×1014 List: 12×109 Compressed: 12×109 or 8×109+4×105
37. 37. Sparse Matrix ● Sparse matrix is useful for: – Large amount of categorical data – Text Analysis – Tag Analysis
38. 38. R package: Matrix m1 <- matrix(0, 5, 5);m1[1, 4] <- 1 m1 library(Matrix) m2 <- Matrix(0, 5, 5, sparse=TRUE) m2[1,4] <- 1 m2
39. 39. Computation Speed m1 <- matrix(0, 5, 5);m1[1, 4] <- 1 library(Matrix) m2 <- Matrix(0, 5, 5, sparse=TRUE) m2[1,4] <- 1
40. 40. Advanced tips: package Rcpp ● C/C++ uses memory more efficiently ● Rcpp provides easy interface for R and C/C++ #include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] SEXP XTv(S4 m, NumericVector v, NumericVector& retval) { //... }
41. 41. Two approach of fitting logistic regression to large-scaled data ● Batch Algorithm – Optimize the log likelihood globally ● Online Algorithm – Optimize the loss function instance per instance
42. 42. Batch Algorithm Negative Loglikelihood: f (w∣( x1, y1) ,⋯,( xm , ym)) m −yt log(σ(wT xt))−(1−yt)log(1−σ(wT xt)) =Σt =1 Gradient Decent: wt+1=wt−η∇ f (wt) Each update requires scanning all data
43. 43. Parallelize Existed Batch Algorithm Rowise Partition (X1 X2)v=(X1 v X2 v) (v1 v2)(X1 X2)=v1 X1+v2 X2 ● We could split data by instances to several machines ● The matrix-vector multiplication could be parallelized
44. 44. Framework of Parallelization ● Hadoop – Slow for iterative algorithm – False tolerance – Good for many machines ● MPI – If in memory, fast for iterative algorithm – No false tolerance – Good for several machines
45. 45. R Package: pbdMPI
46. 46. R Package: pbdMPI ● Easy to install (on ubuntu) – sudo apt-get install openmpi-bin openmpi-common libopenmpi-dev – install.packages("pbdMPI") ● Easy to develop (compared to Rmpi)
47. 47. R Package: pbdMPI library(pbdMPI) .rank <- comm.rank() filename <- sprintf("%d.csv", .rank) data <- read.csv(filename) target <- reduce(sum(data\$value), op="sum") finalize()
48. 48. Parallelize Algorithm with pbdMPI ● Implement functions required for optimization with pbdMPI – optim requires f and g (gradient of f) – nlminb requires f, g, and H(hessian of f) – tron requires f, g, and Hs(H multiply a given vector s)
49. 49. Some Tips of Optimization ● Take care of stopping criteria – A relative threshold might be enough ● Save the coefficient during iteration and print the value of f and g with operator <<- – You can stop the iteration anytime – Monitor the convergence
50. 50. Overview
51. 51. LinkedIn Way Deepak Agarwal. Computational Advertising: The LinkedIn Way. CIKM 2013 ● Too many data to fit in single machine – Billions of observations, million of features ● A Naive Approach – Partition the data and run logistic regression for each partition – Take the mean of the learned coefficients – Problem: Not guaranteed to converge to the model from single machine! ● Alternating Direction Method of Multipliers (ADMM) – Boyd et al. 2011 (based on earlier work from the 70s)
52. 52. ADMM For each nodes, the data and coefficients are different K f k (wk )+λ2∥w∥2 2 Σk =1 subject to wk=w∀k k =argminwk f k (wk )+ wt+1 ρ2 ∥wk−wt+ut k∥2 2 wt+1=argminw λ2∥w∥2 2 + ρ2 K ∥wt+1 Σk =1 k −w+ut k∥2 2 k =ut ut+1 k+wt+1 k −wt+1
53. 53. Update Coefficient Deepak Agarwal. Computational Advertising: The LinkedIn Way. CIKM 2013 BBIIGG DDAATTAA PPaarrttiittiioonn 11 PPaarrttiittiioonn 22 PPaarrttiittiioonn 33 PPaarrttiittiioonn KK Logistic Regression Logistic Regression Logistic Regression Logistic Regression Consensus Computation
54. 54. Update Regularization Deepak Agarwal. Computational Advertising: The LinkedIn Way. CIKM 2013 BBIIGG DDAATTAA PPaarrttiittiioonn 11 PPaarrttiittiioonn 22 PPaarrttiittiioonn 33 PPaarrttiittiioonn KK Logistic Regression Consensus Computation Logistic Regression Logistic Regression Logistic Regression
55. 55. Our Remark of ADMM ● ADMM saves the communication between the nodes ● In our environment, the overhead of communication is affordable – ADMM does not enhance the performance of our system
56. 56. Online Algorithm Stochastic Gradient Decent(SGD): f (w| yt , xt ) =−yt log(σ(wT xt ))−(1−yt )log(1−σ(wT xt )) wt+1=wt−η∇ f (wt| yt , xt ) ● Choose an initial value and learning rate ● Randomly shuffle the instance in the training set ● Scan the data and update the coefficient – Repeat until an approximate minimum is obtained
57. 57. SGD to Follow The Proximal Regularized Leader H. Brendan McMahan. Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization. AISTATS 2011 wt+1 = wt−ηt∇ f (wt| yt , xt) = argminw∇ f (wt| yt , xt )T w+ 1 2ηt (w−wt )T (w−wt) t f (wi| yi , xi) Let gt=f (wt| yt , xt ) and g1 :t=Σi =1 T w+t λ1‖w‖1+ wt+1 = argminw g1 :t t ‖w−wi‖2 2 λ2 2 Σi =1
58. 58. Regret of SGD H. Brendan McMahan and Matthew Streeter. Adaptive Bound Optimization for Online Convex Optimization. COLT 2010 T f t (wt )−minwΣt Regret :=Σt =1 T f t (w) =1 Global learning rate achives regret bound O(D M√T ) D is the L2 diameter of the feasible set M is the L2 bound of g
59. 59. Regret of SGD H. Brendan McMahan and Matthew Streeter. Adaptive Bound Optimization for Online Convex Optimization. COLT 2010 Per-coordinate Learning Rate: ηt ,i= α β+√Σs t gs ,i =1 2 achieves regret bound O(√T n 1−γ 2 ) n is the dimension of w. If w∈[−0.5,0.5]n ,D=√n P(xt ,i=1)∼i−γ for some γ∈[1,2)
60. 60. Comparison of Learning Rate Schema Xinran He, et. al. Practical Lessons from Predicting Clicks on Ads at Facebook. ADKDD 2014.
61. 61. Google KDD 2013, FTPRL H. Brendan McMahan, et. al. Ad Click Prediction: a View from the Trenches. KDD 2013.
62. 62. Some Remark for FTPRL ● FTPRL is a general optimization framework. – We used it successfully to fit neuron network ● The per-coordinate learning rate greatly improves the convergence on our data – SGD works with per-coordinate learning rate ● The “Proximal” part decreases the accuracy, but introduces the sparsity
63. 63. Implementation of FTPRL in R ● I am not aware of any implementation of online optimization in R ● The algorithm is simple. Just write it with a for loop. ● The overhead of loop is small in C/C++ compared to R ● I implemented the algorithm in https://github.com/wush978/BridgewellML/tree/r-pkg – Call for user – Contact me if you want to try
64. 64. FTPRL v.s. TRON
65. 65. Batch v.s. Online Olivier Chapelle, et. al. Simple and scalable response prediction for display advertising. ● Batch Algorithm – Optimize the likelihood function to a high accuracy once they are in a good neighborhood of the optimal solution. – Quite slow in reaching the solution – Straightforward to generalize batch learning to distributed environment ● Online Algorithm (mini-batch) – Optimize the likelihood to a rough precision quite fast – A handful of passes over the data. – Tricky to parallelize
66. 66. Criteo Inc. Hybrid of Online and Batch ● For each node, making one online pass over its local data according to adaptive gradient updates. ● Average these local weights to be the initial value of L-BFGS.
67. 67. Facebook Xinran He, et. al. Practical Lessons from Predicting Clicks on Ads at Facebook. ADKDD 2014. ● Decision Tree (Batch) for Feature Transforms ● Logistic Regression (Online)
68. 68. Data Size and Accuracy
69. 69. Experiment Designs
70. 70. Experiments Result
71. 71. Experiment Analysis
72. 72. Improving New Models New Algorithms New Features Experiments Analysis
73. 73. Display Advertising Challenge ● https://www.kaggle.com/c/criteo-display-ad-challenge ● 7 * 10^7 instances ● 13 integer features and 26 categorical features with about 3 * 10^7 levels ● We were 9th over 718 teams – We fit the neuron network (2-layer logistic regression) to the data with FTPRL and dropout
74. 74. Dropout in SGD Geoffrey E. Hinton, et. al. Improving neural networks by preventing co-adaptation of feature detectors. CoRR 2012
75. 75. Tools of Large-scale Model Fitting ● Almost top 10 competitors were implemented algorithm by themselves – There is no dominant tool for large-scale model fitting ● The winner used 20GB memory only. See https://github.com/guestwalk/kaggle-2014-criteo ● For single machine, there are some good machine learning library – LIBLINEAR for linear model (The student in the Lab is no.1) – xgboost for gradient boosted regression tree (The author is no.12) – Vowpal Wabbit
76. 76. Thanks for your listening
• #### ssuser40269b

May. 28, 2017
• #### linekin

Jan. 15, 2017
• #### votis

Jun. 29, 2016

Jan. 8, 2016
• #### alxcho

Dec. 16, 2015
• #### linyiling

Oct. 25, 2015
• #### li_hui

Jul. 31, 2015
• #### nkhuyu

Jul. 29, 2015
• #### meiryedidia

Apr. 14, 2015

Mar. 7, 2015

Nov. 3, 2014
• #### wjmuse

Oct. 30, 2014
• #### jazzwang

Oct. 30, 2014
• #### justin2061

Oct. 28, 2014
• #### xiaxbud

Oct. 26, 2014
• #### StoneReal

Oct. 26, 2014
• #### checkincheckin

Oct. 25, 2014
• #### chihlisung

Oct. 25, 2014
• #### chankhou9

Oct. 25, 2014
• #### chihchengliang

Oct. 24, 2014

Introducing my studying to my teachers and schoolmates in Statistical Institute of NTHU, Taiwan.

#### Vistas

Total de vistas

4.233

En Slideshare

0

De embebidos

0

Número de embebidos

64

Descargas

153

Compartidos

0

Comentarios

0

Me gusta

21