SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
Regularization
Yow-Bang (Darren) Wang
8/1/2013
Outline
● VC dimension & VC bound – Frequentist viewpoint
● L1 regularization – An intuitive interpretation
● Model parameter prior – Bayesian viewpoint
● Early stopping – Also a regularization
● Conclusion
VC dimension & VC bound
– Frequentist viewpoint
Regularization
● (My) definition: Techniques to prevent overfitting
● Frequentists’ viewpoint:
○ Regularization = suppress model complexity
○ “Usually” done by inserting a term representing model complexity into the objective function:
Training
error
Model
complexity
Trade-off weight
VC dimension & VC bound
● Why suppressing model complexity?
○ A theoretical bound of testing error, called Vapnik–Chervonenkis (VC) bound, state the
follows:
● To reduce the testing error, we prefer:
○ Low training error ( Etrain
↓)
○ Big data ( N ↑)
○ Low model complexity ( dVC
↓)
VC dimension & VC bound
● : VC dimension
○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a certain
set of instances that can be binary-classified into any combination of class labels by H.
● Example: H = {straight lines in 2D space}
Label=1
Label=0
Label=1
Label=0
Label=1
Label=0
……
VC dimension & VC bound
● : VC dimension
○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a certain
set of instances that can be binary-classified into any combination of class labels by H.
● Example: H = {straight lines in 2D space}
○ N=2: {0,0}, {0,1}, {1,0}, {1,1}
VC dimension & VC bound
● : VC dimension
○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a certain
set of instances that can be binary-classified into any combination of class labels by H.
● Example: H = {straight lines in 2D space}
○ N=2: {0,0}, {0,1}, {1,0}, {1,1}
○ N=3: {0,0,0}, {0,0,1},……, {1,1,1}
VC dimension & VC bound
● : VC dimension
○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a certain
set of instances that can be binary-classified into any combination of class labels by H.
● Example: H = {straight lines in 2D space}
○ N=2: {0,0}, {0,1}, {1,0}, {1,1}
○ N=3: {0,0,0}, {0,0,1},……, {1,1,1}
○ N=4: fails in the case:
Regularization – Frequentist viewpoint
● In general, more model parameters
↔ higher VC dimension
↔ higher model complexity
↔
Regularization – Frequentist viewpoint
● ……Therefore, reduce model complexity
↔ reduce VC dimension
↔ reduce number of free parameters
↔ reduce
↔ sparsity of parameter!
L-0 norm
Regularization – Frequentist viewpoint
● The L-p norm of a K-dimensional vector x:
1. L-2 norm:
2. L-1 norm:
3. L-0 norm: defined as
4. L-∞ norm:
Regularization – Frequentist viewpoint
● However, since L-0 norm is hard to incorporate into the objective function (∵
not continuous), we turn to the other more approachable L-p norms
● E.g. Linear SVM:
● Linear SVM = Hinge loss + L-2 regularization!
L-2 regularization (a.k.a. Large Margin)Trade-off weight
Hinge Loss:
L1 regularization
– An intuitive interpretation
L1 Regularization – An Intuitive Interpretation
● Now we know we prefer sparse parameters
○ ↔ small L-0 norm
● ……but why people say minimizing L1 norm would introduce sparsity?
● “For most large underdetermined systems of linear equations, the minimal L1‐
norm solution is also the sparsest solution”
○ Donoho, David L, Communications on pure and applied mathematics, 2006.
L1 Regularization – An Intuitive Interpretation
● An intuitive interpretation: L-p norm ≣ control our preference to parameters
○ L-2 norm:
○ L-1 norm:
Equal-preferable lines
<Parameter Space>
L1 Regularization – An Intuitive Interpretation
● Intuition: using L1 regularization, it’s more possible that the minimal training
error occurs at the tip points of parameter preference lines
○ Assume the equal training error lines are concentric circles ……
Equal training error lines
Optimal solution
L1 Regularization – An Intuitive Interpretation
● Intuition: using L1 regularization, it’s more possible that the minimal training
error occurs at the tip points of parameter preference lines
○ Assume the equal training error lines are concentric circles ……
……
L1 Regularization – An Intuitive Interpretation
● Intuition: using L1 regularization, it’s more possible that the minimal training
error occurs at the tip points of parameter preference lines
○ Assume the equal training error lines are concentric circles, then the minimal training error
occurs at the tip points iff the centric of equal training error lines lies in the shaded areas as
the figure shows, which is relatively highly probable!
Model parameter prior
– Bayesian viewpoint
Regularization – Bayesian viewpoint
● Bayesian: model parameters are probabilistic.
● Frequentist: model parameters are deterministic.
Given observation
Fixed yet unknown universe
Sampling
Estimate
parameters
Unknown universe
Random observation
Sampling
Estimate parameters
assuming the universe is
a certain type of model
Regularization – Bayesian viewpoint
● To conclude:
Data Model parameter
Bayesian Fixed Variable
Frequentist Variable Fixed yet unknown
Regularization – Bayesian viewpoint
● E.g. L-2 regularization
● Assume the parameters w are from a Gaussian distribution with zero-mean,
identity covariance:
<Parameter Probability Space>
Equal probability lines
Regularization – Bayesian viewpoint
● E.g. L-2 regularization
● Assume the parameters w are from a Gaussian distribution with zero-mean,
identity covariance:
Early stopping
– Also a regularization
Early Stopping
● Early stopping – stop training before optimal
● Often used in MLP training
● An intuitive interpretation:
○ Training iteration ↑
○ → number of updates of weights ↑
○ → number of active (far from 0) weights ↑
○ → complexity ↑
Early Stopping
● Theoretical proof:
○ Consider a perceptron with hinge loss:
○ Assume the optimal separating hyperplane is , with maximal margin
○ Denote the weight at t-th iteration as , with margin
Early Stopping
●
1.
∵
Early Stopping
●
1.
2.
R: radius of
data distribution
R
Early Stopping
●
1.
2.
→
R: radius of
data distribution
R
Early Stopping
● Small learning rate → Large margin
● Small number of updates → Large margin
→ Early Stopping!!!
Early Stopping
Early Stopping
Training iteration ↑
Conclusion
Conclusion
● Regularization: Techniques to prevent overfitting
○ L1-norm: Sparsity of parameter
○ L2-norm: Large Margin
○ Early stopping
○ ……etc.
● The philosophy of regularization
○ Occam’s razor: “Entities must not be multiplied beyond necessity.”
Reference
● Learning From Data - A Short Course
○ Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin
● Ronan Collobert, Samy Bengio, “Links Between Perceptrons, MLPs and
SVMs”, in ACM 2004.

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Logistic regression in Machine Learning
Logistic regression in Machine LearningLogistic regression in Machine Learning
Logistic regression in Machine Learning
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic Regression
 
Feature selection
Feature selectionFeature selection
Feature selection
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
Feed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descentFeed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descent
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
07 regularization
07 regularization07 regularization
07 regularization
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Regression vs Deep Neural net vs SVM
Regression vs Deep Neural net vs SVMRegression vs Deep Neural net vs SVM
Regression vs Deep Neural net vs SVM
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
 
Lasso and ridge regression
Lasso and ridge regressionLasso and ridge regression
Lasso and ridge regression
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 

Similar a Regularization

Similar a Regularization (20)

Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validation
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
 
1 d,2d laplace inversion of lr nmr
1 d,2d laplace inversion of lr nmr1 d,2d laplace inversion of lr nmr
1 d,2d laplace inversion of lr nmr
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptx
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiers
 
Neural Network Approximation.pdf
Neural Network Approximation.pdfNeural Network Approximation.pdf
Neural Network Approximation.pdf
 
Lecture 9 Perceptron
Lecture 9 PerceptronLecture 9 Perceptron
Lecture 9 Perceptron
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Support Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the theSupport Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the the
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Regression
RegressionRegression
Regression
 
Chapter4
Chapter4Chapter4
Chapter4
 
Regression vs Neural Net
Regression vs Neural NetRegression vs Neural Net
Regression vs Neural Net
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 Regression
 
Regression.pptx
Regression.pptxRegression.pptx
Regression.pptx
 
Regression.pptx
Regression.pptxRegression.pptx
Regression.pptx
 
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data SetsMethods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
 
Machine Learning - Regression model
Machine Learning - Regression modelMachine Learning - Regression model
Machine Learning - Regression model
 
Interval programming
Interval programming Interval programming
Interval programming
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines Simply
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Regularization

  • 2. Outline ● VC dimension & VC bound – Frequentist viewpoint ● L1 regularization – An intuitive interpretation ● Model parameter prior – Bayesian viewpoint ● Early stopping – Also a regularization ● Conclusion
  • 3. VC dimension & VC bound – Frequentist viewpoint
  • 4. Regularization ● (My) definition: Techniques to prevent overfitting ● Frequentists’ viewpoint: ○ Regularization = suppress model complexity ○ “Usually” done by inserting a term representing model complexity into the objective function: Training error Model complexity Trade-off weight
  • 5. VC dimension & VC bound ● Why suppressing model complexity? ○ A theoretical bound of testing error, called Vapnik–Chervonenkis (VC) bound, state the follows: ● To reduce the testing error, we prefer: ○ Low training error ( Etrain ↓) ○ Big data ( N ↑) ○ Low model complexity ( dVC ↓)
  • 6. VC dimension & VC bound ● : VC dimension ○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a certain set of instances that can be binary-classified into any combination of class labels by H. ● Example: H = {straight lines in 2D space} Label=1 Label=0 Label=1 Label=0 Label=1 Label=0 ……
  • 7. VC dimension & VC bound ● : VC dimension ○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a certain set of instances that can be binary-classified into any combination of class labels by H. ● Example: H = {straight lines in 2D space} ○ N=2: {0,0}, {0,1}, {1,0}, {1,1}
  • 8. VC dimension & VC bound ● : VC dimension ○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a certain set of instances that can be binary-classified into any combination of class labels by H. ● Example: H = {straight lines in 2D space} ○ N=2: {0,0}, {0,1}, {1,0}, {1,1} ○ N=3: {0,0,0}, {0,0,1},……, {1,1,1}
  • 9. VC dimension & VC bound ● : VC dimension ○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a certain set of instances that can be binary-classified into any combination of class labels by H. ● Example: H = {straight lines in 2D space} ○ N=2: {0,0}, {0,1}, {1,0}, {1,1} ○ N=3: {0,0,0}, {0,0,1},……, {1,1,1} ○ N=4: fails in the case:
  • 10. Regularization – Frequentist viewpoint ● In general, more model parameters ↔ higher VC dimension ↔ higher model complexity ↔
  • 11. Regularization – Frequentist viewpoint ● ……Therefore, reduce model complexity ↔ reduce VC dimension ↔ reduce number of free parameters ↔ reduce ↔ sparsity of parameter! L-0 norm
  • 12. Regularization – Frequentist viewpoint ● The L-p norm of a K-dimensional vector x: 1. L-2 norm: 2. L-1 norm: 3. L-0 norm: defined as 4. L-∞ norm:
  • 13. Regularization – Frequentist viewpoint ● However, since L-0 norm is hard to incorporate into the objective function (∵ not continuous), we turn to the other more approachable L-p norms ● E.g. Linear SVM: ● Linear SVM = Hinge loss + L-2 regularization! L-2 regularization (a.k.a. Large Margin)Trade-off weight Hinge Loss:
  • 14. L1 regularization – An intuitive interpretation
  • 15. L1 Regularization – An Intuitive Interpretation ● Now we know we prefer sparse parameters ○ ↔ small L-0 norm ● ……but why people say minimizing L1 norm would introduce sparsity? ● “For most large underdetermined systems of linear equations, the minimal L1‐ norm solution is also the sparsest solution” ○ Donoho, David L, Communications on pure and applied mathematics, 2006.
  • 16. L1 Regularization – An Intuitive Interpretation ● An intuitive interpretation: L-p norm ≣ control our preference to parameters ○ L-2 norm: ○ L-1 norm: Equal-preferable lines <Parameter Space>
  • 17. L1 Regularization – An Intuitive Interpretation ● Intuition: using L1 regularization, it’s more possible that the minimal training error occurs at the tip points of parameter preference lines ○ Assume the equal training error lines are concentric circles …… Equal training error lines Optimal solution
  • 18. L1 Regularization – An Intuitive Interpretation ● Intuition: using L1 regularization, it’s more possible that the minimal training error occurs at the tip points of parameter preference lines ○ Assume the equal training error lines are concentric circles …… ……
  • 19. L1 Regularization – An Intuitive Interpretation ● Intuition: using L1 regularization, it’s more possible that the minimal training error occurs at the tip points of parameter preference lines ○ Assume the equal training error lines are concentric circles, then the minimal training error occurs at the tip points iff the centric of equal training error lines lies in the shaded areas as the figure shows, which is relatively highly probable!
  • 20. Model parameter prior – Bayesian viewpoint
  • 21. Regularization – Bayesian viewpoint ● Bayesian: model parameters are probabilistic. ● Frequentist: model parameters are deterministic. Given observation Fixed yet unknown universe Sampling Estimate parameters Unknown universe Random observation Sampling Estimate parameters assuming the universe is a certain type of model
  • 22. Regularization – Bayesian viewpoint ● To conclude: Data Model parameter Bayesian Fixed Variable Frequentist Variable Fixed yet unknown
  • 23. Regularization – Bayesian viewpoint ● E.g. L-2 regularization ● Assume the parameters w are from a Gaussian distribution with zero-mean, identity covariance: <Parameter Probability Space> Equal probability lines
  • 24. Regularization – Bayesian viewpoint ● E.g. L-2 regularization ● Assume the parameters w are from a Gaussian distribution with zero-mean, identity covariance:
  • 25. Early stopping – Also a regularization
  • 26. Early Stopping ● Early stopping – stop training before optimal ● Often used in MLP training ● An intuitive interpretation: ○ Training iteration ↑ ○ → number of updates of weights ↑ ○ → number of active (far from 0) weights ↑ ○ → complexity ↑
  • 27. Early Stopping ● Theoretical proof: ○ Consider a perceptron with hinge loss: ○ Assume the optimal separating hyperplane is , with maximal margin ○ Denote the weight at t-th iteration as , with margin
  • 29. Early Stopping ● 1. 2. R: radius of data distribution R
  • 30. Early Stopping ● 1. 2. → R: radius of data distribution R
  • 31. Early Stopping ● Small learning rate → Large margin ● Small number of updates → Large margin → Early Stopping!!!
  • 35. Conclusion ● Regularization: Techniques to prevent overfitting ○ L1-norm: Sparsity of parameter ○ L2-norm: Large Margin ○ Early stopping ○ ……etc. ● The philosophy of regularization ○ Occam’s razor: “Entities must not be multiplied beyond necessity.”
  • 36. Reference ● Learning From Data - A Short Course ○ Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin ● Ronan Collobert, Samy Bengio, “Links Between Perceptrons, MLPs and SVMs”, in ACM 2004.