Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Próxima SlideShare
I am Flexi
Siguiente
Descargar para leer sin conexión y ver en pantalla completa.

21

Compartir

Descargar para leer sin conexión

Online advertising and large scale model fitting

Descargar para leer sin conexión

Introducing my studying to my teachers and schoolmates in Statistical Institute of NTHU, Taiwan.

Audiolibros relacionados

Gratis con una prueba de 30 días de Scribd

Ver todo

Online advertising and large scale model fitting

  1. 1. Online Advertising and Large-scale model fitting Wush Wu 2014-10-24
  2. 2. Outline ● Introduction of Online Advertising ● Handling Real Data – Data Engineering – Model Matrix – Enhance Computation Speed of R ● Fitting Model to Large Scale Data – Batch Algorithm – Parallelizing Existed Algorithm – Online Algorithm – SGD, FTPRL and Learning Rate Schema ● Display Advertising Challenge
  3. 3. Ad Formats – Pre Roll Video Ads
  4. 4. Ad Formats – Banner/ Display Ads
  5. 5. Adwords Search Ads
  6. 6. Related Content Ads
  7. 7. Online Advertising is Growing Rapidly
  8. 8. Why Online Advertising is Growing? ● Wide reach ● Target oriented ● Quick conversion ● Highly informative ● Cost-effective ● Easy to use Measurable Half the money I spend on advertising is wasted; the trouble is I don't know which half.
  9. 9. How do we measure the online ad? ● The user behavior on the internet is trackable. – We know who watches the ad. – We know who buys the product. ● We collect data for measurement.
  10. 10. How do we collect the data?
  11. 11. Performance-based advertising ● Pricing Model – Cost-Per-Mille (CPM) – Cost-Per-Click (CPC) – Cost-Per-Action (CPA) or Cost-Per-Order (CPO)
  12. 12. To Improve Profit ● Display the ad with high Click-Through Rate(CTR) * CPC, or Conversion Rate (CVR) * CPO ● Estimation of the probability of click (conversion) is the central problem – Rule Based – Statistical Modeling (Machine Learning) 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 12 10 8 6 4 2 0
  13. 13. System Website Ad Request Recommendation Website Ad Delivering Log Server Batch Model Fitting Online
  14. 14. Rule Based ● Let the advertiser selects the target group XX
  15. 15. Statistical Modeling ● We log the display and collect the response ● Features – Ad – Channel – User
  16. 16. Features of Ad ● Ad type – Text – Figure – Video ● Ad Content – Fashion – Health – Game
  17. 17. Features of Channel ● Visibility
  18. 18. Features of User ● Sex ● Age ● Location ● Behavior
  19. 19. Real Features Zhang, Weinan and Yuan, Shuai and Wang, Jun and Shen, Xuehua. Real-Time Bidding Benchmarking with iPinYou Dataset
  20. 20. Know How v.s. Know Why ● We usually do not study the reason of high CTR ● Little improvement of accuracy implies large improvement of profit ● Predictive Analysis
  21. 21. Data ● School – Static – Cleaned – Public ● Commercial – Dynamic – Error – Private
  22. 22. Data Engineering Impression Click CLICK_TIME CLIENT_IP CLICKED ADID 2014/05/17 ... 2.17.x.x 133594 2014/05/17 ... 140.112.x.x 134811 +
  23. 23. Data Engineering with R http://wush978.github.io/REngineering/ ● Automation of R Jobs – Convert R script to command line application – Learn modern tools such as jenkins ● Connections between multiple machine – Learn ssh ● Logging – Linux tools: bash redirection, tee – R package: logging ● R Error Handling – try, tryCatch
  24. 24. Characteristic of Data ● Rare Event ● Large Amount of Categorical Features – Binning Numerical Features ● Features are highly correlated ● Some features occurs frequently, some occurs rarely
  25. 25. Common Statistical Model for CTR ● Logistic Regression ● Gradient Boosted Regression Tree – Check xgboost
  26. 26. Logistic Regression P(Click| x)= 1 1+e−wT x=σ(wT x) ● Linear relationship with features – Fast prediction – (Relative) Fast Fitting ● Usually fit the model with L2 regularization
  27. 27. How large is the data? ● Instances: 10^9 ● Binary features: 10^5
  28. 28. Subsampling ● Sampling is useful for: – Data exploration – Code testing ● Sampling might harm the accuracy (profit) – Rare event – Some features occurs frequently and some occurs rarely ● We do not subsample data so far
  29. 29. Sampling ● Olivier Chapelle, et. al. Simple and scalable response prediction for display advertising.
  30. 30. Computation P(Click| x)= 1 wT x 1+e−wT x
  31. 31. Model Matrix head(model.matrix(Species ~ ., iris))
  32. 32. Dense Matrix ● 10^9 instances ● 10^5 binary features ● 10^14 elements for model matrix ● Size: 4 * 10^14 bytes – 400 TB ● In memory is about 10^3 faster than on disk
  33. 33. R and Large Scale Data ● R cannot handle large scale data ● R consumes lots of memory
  34. 34. Sparse Matrix
  35. 35. Sparse Matrix A∈ℝm×n and k nonzero elements Dense Matrix: [1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0] requires 4 mn size List: (1, 1,1) ,(1, 3,1) ,(4, 3,1) requires 12k size Compressed List: i :{1, 3,3 }, p: {2, 0, 0,1}, x :{1, 1,1} requires 8 k+4m size j :{1,1, 4 }, p:{1, 0, 2,0}, x :{1,1,1} requires 8 k+4 n size
  36. 36. Sparse Matrix ● The size of non-zero could be estimated by the number of categorical variable m∼109 n∼105 k∼101×109 Dense Matrix: 4×1014 List: 12×109 Compressed: 12×109 or 8×109+4×105
  37. 37. Sparse Matrix ● Sparse matrix is useful for: – Large amount of categorical data – Text Analysis – Tag Analysis
  38. 38. R package: Matrix m1 <- matrix(0, 5, 5);m1[1, 4] <- 1 m1 library(Matrix) m2 <- Matrix(0, 5, 5, sparse=TRUE) m2[1,4] <- 1 m2
  39. 39. Computation Speed m1 <- matrix(0, 5, 5);m1[1, 4] <- 1 library(Matrix) m2 <- Matrix(0, 5, 5, sparse=TRUE) m2[1,4] <- 1
  40. 40. Advanced tips: package Rcpp ● C/C++ uses memory more efficiently ● Rcpp provides easy interface for R and C/C++ #include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] SEXP XTv(S4 m, NumericVector v, NumericVector& retval) { //... }
  41. 41. Two approach of fitting logistic regression to large-scaled data ● Batch Algorithm – Optimize the log likelihood globally ● Online Algorithm – Optimize the loss function instance per instance
  42. 42. Batch Algorithm Negative Loglikelihood: f (w∣( x1, y1) ,⋯,( xm , ym)) m −yt log(σ(wT xt))−(1−yt)log(1−σ(wT xt)) =Σt =1 Gradient Decent: wt+1=wt−η∇ f (wt) Each update requires scanning all data
  43. 43. Parallelize Existed Batch Algorithm Rowise Partition (X1 X2)v=(X1 v X2 v) (v1 v2)(X1 X2)=v1 X1+v2 X2 ● We could split data by instances to several machines ● The matrix-vector multiplication could be parallelized
  44. 44. Framework of Parallelization ● Hadoop – Slow for iterative algorithm – False tolerance – Good for many machines ● MPI – If in memory, fast for iterative algorithm – No false tolerance – Good for several machines
  45. 45. R Package: pbdMPI
  46. 46. R Package: pbdMPI ● Easy to install (on ubuntu) – sudo apt-get install openmpi-bin openmpi-common libopenmpi-dev – install.packages("pbdMPI") ● Easy to develop (compared to Rmpi)
  47. 47. R Package: pbdMPI library(pbdMPI) .rank <- comm.rank() filename <- sprintf("%d.csv", .rank) data <- read.csv(filename) target <- reduce(sum(data$value), op="sum") finalize()
  48. 48. Parallelize Algorithm with pbdMPI ● Implement functions required for optimization with pbdMPI – optim requires f and g (gradient of f) – nlminb requires f, g, and H(hessian of f) – tron requires f, g, and Hs(H multiply a given vector s)
  49. 49. Some Tips of Optimization ● Take care of stopping criteria – A relative threshold might be enough ● Save the coefficient during iteration and print the value of f and g with operator <<- – You can stop the iteration anytime – Monitor the convergence
  50. 50. Overview
  51. 51. LinkedIn Way Deepak Agarwal. Computational Advertising: The LinkedIn Way. CIKM 2013 ● Too many data to fit in single machine – Billions of observations, million of features ● A Naive Approach – Partition the data and run logistic regression for each partition – Take the mean of the learned coefficients – Problem: Not guaranteed to converge to the model from single machine! ● Alternating Direction Method of Multipliers (ADMM) – Boyd et al. 2011 (based on earlier work from the 70s)
  52. 52. ADMM For each nodes, the data and coefficients are different K f k (wk )+λ2∥w∥2 2 Σk =1 subject to wk=w∀k k =argminwk f k (wk )+ wt+1 ρ2 ∥wk−wt+ut k∥2 2 wt+1=argminw λ2∥w∥2 2 + ρ2 K ∥wt+1 Σk =1 k −w+ut k∥2 2 k =ut ut+1 k+wt+1 k −wt+1
  53. 53. Update Coefficient Deepak Agarwal. Computational Advertising: The LinkedIn Way. CIKM 2013 BBIIGG DDAATTAA PPaarrttiittiioonn 11 PPaarrttiittiioonn 22 PPaarrttiittiioonn 33 PPaarrttiittiioonn KK Logistic Regression Logistic Regression Logistic Regression Logistic Regression Consensus Computation
  54. 54. Update Regularization Deepak Agarwal. Computational Advertising: The LinkedIn Way. CIKM 2013 BBIIGG DDAATTAA PPaarrttiittiioonn 11 PPaarrttiittiioonn 22 PPaarrttiittiioonn 33 PPaarrttiittiioonn KK Logistic Regression Consensus Computation Logistic Regression Logistic Regression Logistic Regression
  55. 55. Our Remark of ADMM ● ADMM saves the communication between the nodes ● In our environment, the overhead of communication is affordable – ADMM does not enhance the performance of our system
  56. 56. Online Algorithm Stochastic Gradient Decent(SGD): f (w| yt , xt ) =−yt log(σ(wT xt ))−(1−yt )log(1−σ(wT xt )) wt+1=wt−η∇ f (wt| yt , xt ) ● Choose an initial value and learning rate ● Randomly shuffle the instance in the training set ● Scan the data and update the coefficient – Repeat until an approximate minimum is obtained
  57. 57. SGD to Follow The Proximal Regularized Leader H. Brendan McMahan. Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization. AISTATS 2011 wt+1 = wt−ηt∇ f (wt| yt , xt) = argminw∇ f (wt| yt , xt )T w+ 1 2ηt (w−wt )T (w−wt) t f (wi| yi , xi) Let gt=f (wt| yt , xt ) and g1 :t=Σi =1 T w+t λ1‖w‖1+ wt+1 = argminw g1 :t t ‖w−wi‖2 2 λ2 2 Σi =1
  58. 58. Regret of SGD H. Brendan McMahan and Matthew Streeter. Adaptive Bound Optimization for Online Convex Optimization. COLT 2010 T f t (wt )−minwΣt Regret :=Σt =1 T f t (w) =1 Global learning rate achives regret bound O(D M√T ) D is the L2 diameter of the feasible set M is the L2 bound of g
  59. 59. Regret of SGD H. Brendan McMahan and Matthew Streeter. Adaptive Bound Optimization for Online Convex Optimization. COLT 2010 Per-coordinate Learning Rate: ηt ,i= α β+√Σs t gs ,i =1 2 achieves regret bound O(√T n 1−γ 2 ) n is the dimension of w. If w∈[−0.5,0.5]n ,D=√n P(xt ,i=1)∼i−γ for some γ∈[1,2)
  60. 60. Comparison of Learning Rate Schema Xinran He, et. al. Practical Lessons from Predicting Clicks on Ads at Facebook. ADKDD 2014.
  61. 61. Google KDD 2013, FTPRL H. Brendan McMahan, et. al. Ad Click Prediction: a View from the Trenches. KDD 2013.
  62. 62. Some Remark for FTPRL ● FTPRL is a general optimization framework. – We used it successfully to fit neuron network ● The per-coordinate learning rate greatly improves the convergence on our data – SGD works with per-coordinate learning rate ● The “Proximal” part decreases the accuracy, but introduces the sparsity
  63. 63. Implementation of FTPRL in R ● I am not aware of any implementation of online optimization in R ● The algorithm is simple. Just write it with a for loop. ● The overhead of loop is small in C/C++ compared to R ● I implemented the algorithm in https://github.com/wush978/BridgewellML/tree/r-pkg – Call for user – Contact me if you want to try
  64. 64. FTPRL v.s. TRON
  65. 65. Batch v.s. Online Olivier Chapelle, et. al. Simple and scalable response prediction for display advertising. ● Batch Algorithm – Optimize the likelihood function to a high accuracy once they are in a good neighborhood of the optimal solution. – Quite slow in reaching the solution – Straightforward to generalize batch learning to distributed environment ● Online Algorithm (mini-batch) – Optimize the likelihood to a rough precision quite fast – A handful of passes over the data. – Tricky to parallelize
  66. 66. Criteo Inc. Hybrid of Online and Batch ● For each node, making one online pass over its local data according to adaptive gradient updates. ● Average these local weights to be the initial value of L-BFGS.
  67. 67. Facebook Xinran He, et. al. Practical Lessons from Predicting Clicks on Ads at Facebook. ADKDD 2014. ● Decision Tree (Batch) for Feature Transforms ● Logistic Regression (Online)
  68. 68. Data Size and Accuracy
  69. 69. Experiment Designs
  70. 70. Experiments Result
  71. 71. Experiment Analysis
  72. 72. Improving New Models New Algorithms New Features Experiments Analysis
  73. 73. Display Advertising Challenge ● https://www.kaggle.com/c/criteo-display-ad-challenge ● 7 * 10^7 instances ● 13 integer features and 26 categorical features with about 3 * 10^7 levels ● We were 9th over 718 teams – We fit the neuron network (2-layer logistic regression) to the data with FTPRL and dropout
  74. 74. Dropout in SGD Geoffrey E. Hinton, et. al. Improving neural networks by preventing co-adaptation of feature detectors. CoRR 2012
  75. 75. Tools of Large-scale Model Fitting ● Almost top 10 competitors were implemented algorithm by themselves – There is no dominant tool for large-scale model fitting ● The winner used 20GB memory only. See https://github.com/guestwalk/kaggle-2014-criteo ● For single machine, there are some good machine learning library – LIBLINEAR for linear model (The student in the Lab is no.1) – xgboost for gradient boosted regression tree (The author is no.12) – Vowpal Wabbit
  76. 76. Thanks for your listening
  • ssuser40269b

    May. 28, 2017
  • linekin

    Jan. 15, 2017
  • votis

    Jun. 29, 2016
  • desp0916

    Jan. 8, 2016
  • alxcho

    Dec. 16, 2015
  • linyiling

    Oct. 25, 2015
  • li_hui

    Jul. 31, 2015
  • nkhuyu

    Jul. 29, 2015
  • meiryedidia

    Apr. 14, 2015
  • elangch

    Mar. 7, 2015
  • RJL

    Nov. 3, 2014
  • wjmuse

    Oct. 30, 2014
  • jazzwang

    Oct. 30, 2014
  • justin2061

    Oct. 28, 2014
  • xiaxbud

    Oct. 26, 2014
  • StoneReal

    Oct. 26, 2014
  • checkincheckin

    Oct. 25, 2014
  • chihlisung

    Oct. 25, 2014
  • chankhou9

    Oct. 25, 2014
  • chihchengliang

    Oct. 24, 2014

Introducing my studying to my teachers and schoolmates in Statistical Institute of NTHU, Taiwan.

Vistas

Total de vistas

4.233

En Slideshare

0

De embebidos

0

Número de embebidos

64

Acciones

Descargas

153

Compartidos

0

Comentarios

0

Me gusta

21

×