Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Modern classification techniques

2.471 visualizaciones

Publicado el

Slides to support Austin Machine Learning Meetup, 1/19/2015.
Overview of techniques of recent Kaggle code to perform online logistic regression with FTRL-proximal (SGD, L1/L2 regularization) and hash trick.

Publicado en: Datos y análisis
  • Inicia sesión para ver los comentarios

Modern classification techniques

  1. 1. Modern Classification Techniques Mark Landry Austin Machine Learning Meetup 1/19/2015
  2. 2. Overview • Problem & Data – Click-through rate prediction for online auctions – 40 million rows – Sparse: gather characteristics – Down-sampled • Methods – Logistic regression – Sparse feature handling – Hash trick – Online learning – Online gradient descent – Adaptive learning rate – Regularization (L1 & L2) • Solution characteristics – Fast: 20 minutes – Efficient: ~4GB RAM – Robust: Easy to extend – Accurate: competitive with factorization machines, particularly when extended to key interactions
  3. 3. Two Data Sets • Primary use case: click logs – 40 million rows – 20 columns – Values appear in dense fashion, but a sparse feature space • For highly informative features types (URL/site) 70% of features have 3 or fewer instances – Note: negatives have been down-sampled • Extended to separate use case: clinical + genomic – 4k rows – 1300 columns – Mix of dense and sparse features
  4. 4. Methods and objectives • Logistic regression: accuracy/base algorithm • Stochastic gradient descent: optimization • Adaptive learning rate: accuracy, speed • Regularization (L1 & L2): generalized solution • Online learning: speed • Sparse feature handling: memory efficiency • Hash trick: memory efficiency, robustness
  5. 5. Implementation Infrastructure • From scratch: no machine learning libraries • Maintain vectors for – Features (1/0) – Weights – Feature Counts • Each vector will use the same index scheme • Hash trick means we can immediately find the index of any feature and we bound the vector size (more later)
  6. 6. Logistic Regression • Natural fit for probability problems (0/1) – 1 / (1 + exp(sum(weight*feature))) – Solves based on log odds – Higher calibration than many other algorithms (particularly decision trees), which is useful for Real Time Bid problem
  7. 7. Sparse Features • All values experience receive a column where the absence/presence • So 1 / (1 + exp(sum(weight*feature))) resolves to 1 / (1 + exp(sum(weight))) for only the features in each instance
  8. 8. Hash Trick • Hash trick allows for quick access into parallel arrays that hold key information to your model • Example: use native python hash(‘string’) to cast into a large integer • Bound the parameter space by using modulo – E.g. abs(hash(‘string’)) % (2 ** 20) – The size of that integer is a parameter, and it allows you to set it as large as your system can handle – Why set it larger? Hash collisions – Keep features separate: abs(hash(feature-name + ‘string’)) % (2 ** 20) • Any hash function can have a collision. The particular function used is fast, but much more likely to encounter a collision than a murmur hash or something more elaborate. • So a speed/accuracy tradeoff dictates what function to use. The larger the bits, the lower the hash collisions.
  9. 9. Online Learning • Learn one record at a time – A prediction is always available at any point, and the best possible given the data the algorithm has seen – Do not have to retrain to take in more data • Though you may still want to • Depending on learning rate used, may desire to iterate through data set more than once • Fast: VW approaches speed of network interface
  10. 10. OGD/SGD: online gradient descent Gradient descent Optimization algorithms are required to minimize the loss in logistic regression Gradient descent, and many variants, are a popular choice, especially with large –scale data. Visualization (in R) library(animation) par(mar = c(4, 4, 2, 0.1)) grad.desc() ani.options(nmax = 50) par(mar = c(4, 4, 2, 0.1)) f2 = function(x, y) sin(1/2 * x^2 - 1/4 * y^2 + 3) * cos(2 * x + 1 - exp(y)) grad.desc(f2, c(-2, -2, 2, 2), c(-1, 0.5), gamma = 0.3, tol = 1e-04) ani.options(nmax = 70) par(mar = c(4, 4, 2, 0.1)) f2 = function(x, y) sin(1/2 * x^2 - 1/4 * y^2 + 3) * cos(2 * x + 1 - exp(y)) grad.desc(f2, c(-2, -2, 2, 2), c(-1, 0.5), gamma = 0.1, tol = 1e-04) # interesting comparison: https://imgur.com/a/Hqolp
  11. 11. Other common optimization algorithms ADAGRAD Still slightly sensitive to choice of n ADADELTANewton’s Method Quasi-Newton Momentum
  12. 12. Adaptive learning rate • Difficulty using SGD is finding a good learning rate • An adaptive learning rate will – ADAGRAD is an adaptive method • Simple learning rate in example code – alpha / (sqrt(n) + 1) • Where N is the number of times a specific feature has been encountered – w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.) • Full weight update will shrink the change by the learning rate of the specific feature
  13. 13. Regularization (L1 & L2) • Regularization attempts to ensure robustness of a solution • Enforces a penalty term on the coefficients of a model, guiding toward a simpler solution • L1: guides parameter values to be 0 • L2: guides parameters to be close to 0, but not 0 • In practice, these ensure large coefficients are not applied to rare features
  14. 14. Related Tools • Vowpal Wabbit – Implements all of these features, plus far more – Command line tool – svmLite-like data format – Source code available on Github with fairly open license • Straight Python implementation (see code references slide) • glmnet, for R: L1/L2 regression, sparse • Scikit-learn, python ML library: ridge, elastic net (l1+l2), SGD (can specify logistic regression) • H2O, Java tool; many techniques used, particularly in deep learning • Many of these techniques are used in neural networks, particularly deep learning
  15. 15. Code References • Introductory version: online logistic regression, hash trick, adaptive learning rate – Kaggle forum post • Data set is available on that competition’s data page • But you can easily adapt the code to work for your data set by changing the train and test file names (lines 25-26) and the names of the id and output columns (104-107, 129-130) – Direct link to python code from forum post – Github version of the same python code • Latest version: adds FTRL-proximal (including SGD, L1/L2 regularization), epochs, and automatic interaction handling – Kaggle forum post – Direct link to python code from forum post (version 3) – Github version of the same python code
  16. 16. Additional References • Overall process – Google paper, FTRL proximal and practical observations – Facebook paper, includes logistic regression and trees, feature handling, down-sampling • Follow The Regularized Leader Proximal (Google) • Optimization – Stochastic gradient descent: examples and guidance (Microsoft) – ADADELTA and discussion of additional optimization algorithms (Google/NYU intern) – Comparison Visualization • Hash trick: – The Wikipedia page offers a decent introduction – general description and list of references, from VW author

×