SlideShare una empresa de Scribd logo
1 de 16
LESLIE SMITH’S PAPERS
FOR DL JOURNAL CLUB
DISCIPLINED APPROACH PAPER
• A disciplined approach to neural network hyperparameters: Part 1 – Learning Rate, Batch Size,
Momentum, and Weight Decay
• There is no Part 2
• https://arxiv.org/abs/1803.09820
• Collection of empirical observations spread out through the paper
CONVERGENCE / TEST-VAL LOSS
• Observe box in top-left corner of Figure 1(a)
• Shows training loss (red) decreasing and validation loss
(blue) decreasing then increasing.
• Plot to left of validation loss minima indicates
underfitting
• Plot to right of validation loss minima indicates
overfitting.
• Achieving the horizontal part of test/validation loss
(minima) is goal of hyperparameter tuning.
UNDERFITTING
• Underfitting is indicated by continuously decreasing
test loss rather than horizontal plateau (Fig 3(a)).
• Steepness of test loss curve indicates how well the
model is learning (Fig 3(b)).
OVERFITTING
• Increasing Learning Rate moves the model from underfitting
to overfitting.
• Blue curve (Fig 4a) shows steepest fall – indication that this
will produce better final accuracy.
• Yellow curve (Fig 4a) shows overfitting with LR > 0.006.
• More overfitting examples – blue curves in bottom figs.
• Blue curve (Fig 4b) shows underfitting.
• Red curve (Fig 4b) shows overfitting.
CYCLIC LEARNING RATE (CLR)
• Motivation: Underfitting if LR too low, overfitting if too high; requires grid search
• CLR
• Specify upper and lower bound for LR
• Specify step size == number of iterations or epochs used for each step
• Cycle consists of 2 steps – first step LR increases linearly from min to max, second step LR decreases linearly
from max to min.
• Other variants tried but no significant benefit observed.
CLR – CHOOSE MAX AND MIN LR
• LR upper bound == min value of LR that causes test / validation loss to increase (and accuracy to
decrease)
• LR lower bound, one of:
• Factor of 3 or 4 less than upper bound.
• Factor of 10 or 20 less than upper bound if only 1 cycle is used.
• Find experimentally using short test of ~1000 iterations, pick largest that allows convergence.
• Step size – if LR too high, training becomes unstable, increase step size to increase difference between
max and min LR bounds.
SUPER CONVERGENCE
• Super convergence – some networks remain stable under
high LR, so can be trained very quickly with CLR with high
upper bound.
• Fig 5a shows super convergence (orange curve) training
faster to higher accuracy using large LR than blue curve.
• 1-cycle policy – one cycle that is smaller than number of
iterations/epochs, then remaining iterations with LR
lowered by several order of magnitude.
REGULARIZATION
• Many forms of regularization
• Large Learning Rate
• Small batch size
• Weight decay (aka L2 regularization)
• Dropout
• Need to balance different regularizers for each dataset and architecture.
• Fig 5b (previous slide) shows tradeoff between weight decay (WD) and LR. Large LR for faster learning
needs to be balanced with lower WD.
• General guidance: reduce other forms of regularization and train with high LR makes training efficient.
BATCH SIZE
• Larger batch sizes permit larger LR using 1cycle schedule.
• Larger batch size may increase training time, so tradeoff
required.
• Tradeoff – use batch size so number of epochs is optimum
for data/model.
• Batch size limited by GPU memory.
• Fig 6a shows validation accuracy for different batch sizes.
Larger batch sizes better but effect tapers off (BS=1024
blue curve very close to BS=512 red curve).
(CYCLIC) MOMENTUM
• Set momentum as large as possible without causing instability.
• Constant LR => use large constant momentum (0.9 – 0.99)
• Cyclic LR => decrease cyclic momentum as cyclic LR increases
during early to middle part of training (0.95 – 0.85).
• Fig 8a – blue curve is constant momentum, red curve is
decreasing CM and yellow curve is increasing CM (with
increasing CLR).
• These observations also carry over to deep networks (Fig 8b).
WEIGHT DECAY
• Cyclical WD not useful, should remain constant throughout
training.
• Value should be found by grid search (ok with early
termination).
• Fig 9a shows loss plots for different values of WD (with LR=5e-
3, mom=0.95).
• Fig 9b shows equivalent accuracy plots.
CYCLIC LEARNING RATE PAPER
• Cyclical Learning Rates for Training Neural Networks
• https://arxiv.org/abs/1506.01186
• Describes CLR in depth and describes results of training common networks with CLR.
CYCLIC LEARNING RATE
• Successor to
• Learning rate schedules – varying LR exponentially over training.
• Adaptive Learning Rates (RMSProp, ADAM, etc) – change LR
based on values of gradients.
• Based on observation that increasing LR has short-term
negative effect but long-term positive effect.
• Let LR vary between range of values.
• Triangular LR (Fig 2) is usually good enough but other variants
also possible.
• Accuracy plot (Fig 1) shows CLR (red curve) is better compared
to Exponential LR.
ESTIMATING CLR PARAMETERS
• Step size
• Step size = 2 to 10 times * number of iterations per epoch
• Number of training iterations per epoch = number of training records /
batch size
• Upper and lower bounds for LR
• Run model for few epochs with some bounds (1e-4 to 2e-1 for
example)
• Upper bound == where accuracy stops increasing, becomes ragged, or
falls (~ 6e-3).
• Lower bound
• Either 1/3 or ¼ of upper bound (~ 2e-3)
• Point at which accuracy starts to increase (~ 1e-3)
LR FINDER USAGE
• LR Finder – first available in Fast.AI library.
• Upper bound – between 1e-3 and 1e-2 (10-3 and 10-2) where loss is
decreasing fastest.
• Can also be found using lr.plot_loss_change() – minimum point (here 1e-2).
• Lower bound is about 1-2 orders of magnitude lower.
• LR Finder (Keras) – https://github.com/surmenok/keras_lr_finder
• LR Finder (Pytorch) -- https://github.com/davidtvs/pytorch-lr-finder
• Keras example -- https://github.com/sujitpal/keras-tutorial-
odsc2020/blob/master/02_03_exercise_2_solved.ipynb
• Fast. AI example --
https://colab.research.google.com/github/fastai/fastbook/blob/master/16_ac
cel_sgd.ipynb

Más contenido relacionado

Similar a Disciplined approach to neural network hyperparameters

Big Data Project - Final version
Big Data Project - Final versionBig Data Project - Final version
Big Data Project - Final versionMihir Sanghavi
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionSunghoon Joo
 
6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptxmohammedalherwi1
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Paper Review: Training ImageNet in 1hour
Paper Review: Training ImageNet in 1hourPaper Review: Training ImageNet in 1hour
Paper Review: Training ImageNet in 1hourYoung Seok Kim
 
Setting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersSetting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersMadhumita Tamhane
 
rbm_final_paper
rbm_final_paperrbm_final_paper
rbm_final_paperSam Bean
 
Competition winning learning rates
Competition winning learning ratesCompetition winning learning rates
Competition winning learning ratesMLconf
 
Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...jemin lee
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep LearningSourya Dey
 
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...
"A Framework for Developing Trading Models Based on Machine Learning" by Kris..."A Framework for Developing Trading Models Based on Machine Learning" by Kris...
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...Quantopian
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...Databricks
 
Linear regression
Linear regressionLinear regression
Linear regressionMartinHogg9
 
Unit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptxUnit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptxSandeep Kumar
 
Dataset Augmentation and machine learning.pdf
Dataset Augmentation and machine learning.pdfDataset Augmentation and machine learning.pdf
Dataset Augmentation and machine learning.pdfsudheeremoa229
 

Similar a Disciplined approach to neural network hyperparameters (20)

Big Data Project - Final version
Big Data Project - Final versionBig Data Project - Final version
Big Data Project - Final version
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed Recognition
 
6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx
 
15303589.ppt
15303589.ppt15303589.ppt
15303589.ppt
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Paper Review: Training ImageNet in 1hour
Paper Review: Training ImageNet in 1hourPaper Review: Training ImageNet in 1hour
Paper Review: Training ImageNet in 1hour
 
Setting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersSetting Artificial Neural Networks parameters
Setting Artificial Neural Networks parameters
 
rbm_final_paper
rbm_final_paperrbm_final_paper
rbm_final_paper
 
4.1.pptx
4.1.pptx4.1.pptx
4.1.pptx
 
Competition winning learning rates
Competition winning learning ratesCompetition winning learning rates
Competition winning learning rates
 
ML MODULE 5.pdf
ML MODULE 5.pdfML MODULE 5.pdf
ML MODULE 5.pdf
 
Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
 
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...
"A Framework for Developing Trading Models Based on Machine Learning" by Kris..."A Framework for Developing Trading Models Based on Machine Learning" by Kris...
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Unit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptxUnit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptx
 
Dataset Augmentation and machine learning.pdf
Dataset Augmentation and machine learning.pdfDataset Augmentation and machine learning.pdf
Dataset Augmentation and machine learning.pdf
 
Tuning learning rate
Tuning learning rateTuning learning rate
Tuning learning rate
 

Más de Sujit Pal

Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSupporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSujit Pal
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Sujit Pal
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question AnsweringSujit Pal
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and TestSujit Pal
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Sujit Pal
 
The power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringThe power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringSujit Pal
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop VisualizationSujit Pal
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudSujit Pal
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Sujit Pal
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalSujit Pal
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsSujit Pal
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesSujit Pal
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSSujit Pal
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingSujit Pal
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildSujit Pal
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSujit Pal
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSujit Pal
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchSujit Pal
 

Más de Sujit Pal (20)

Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSupporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question Answering
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and Test
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...
 
The power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringThe power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestring
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop Visualization
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slides
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity Search
 

Último

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Último (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Disciplined approach to neural network hyperparameters

  • 1. LESLIE SMITH’S PAPERS FOR DL JOURNAL CLUB
  • 2. DISCIPLINED APPROACH PAPER • A disciplined approach to neural network hyperparameters: Part 1 – Learning Rate, Batch Size, Momentum, and Weight Decay • There is no Part 2 • https://arxiv.org/abs/1803.09820 • Collection of empirical observations spread out through the paper
  • 3. CONVERGENCE / TEST-VAL LOSS • Observe box in top-left corner of Figure 1(a) • Shows training loss (red) decreasing and validation loss (blue) decreasing then increasing. • Plot to left of validation loss minima indicates underfitting • Plot to right of validation loss minima indicates overfitting. • Achieving the horizontal part of test/validation loss (minima) is goal of hyperparameter tuning.
  • 4. UNDERFITTING • Underfitting is indicated by continuously decreasing test loss rather than horizontal plateau (Fig 3(a)). • Steepness of test loss curve indicates how well the model is learning (Fig 3(b)).
  • 5. OVERFITTING • Increasing Learning Rate moves the model from underfitting to overfitting. • Blue curve (Fig 4a) shows steepest fall – indication that this will produce better final accuracy. • Yellow curve (Fig 4a) shows overfitting with LR > 0.006. • More overfitting examples – blue curves in bottom figs. • Blue curve (Fig 4b) shows underfitting. • Red curve (Fig 4b) shows overfitting.
  • 6. CYCLIC LEARNING RATE (CLR) • Motivation: Underfitting if LR too low, overfitting if too high; requires grid search • CLR • Specify upper and lower bound for LR • Specify step size == number of iterations or epochs used for each step • Cycle consists of 2 steps – first step LR increases linearly from min to max, second step LR decreases linearly from max to min. • Other variants tried but no significant benefit observed.
  • 7. CLR – CHOOSE MAX AND MIN LR • LR upper bound == min value of LR that causes test / validation loss to increase (and accuracy to decrease) • LR lower bound, one of: • Factor of 3 or 4 less than upper bound. • Factor of 10 or 20 less than upper bound if only 1 cycle is used. • Find experimentally using short test of ~1000 iterations, pick largest that allows convergence. • Step size – if LR too high, training becomes unstable, increase step size to increase difference between max and min LR bounds.
  • 8. SUPER CONVERGENCE • Super convergence – some networks remain stable under high LR, so can be trained very quickly with CLR with high upper bound. • Fig 5a shows super convergence (orange curve) training faster to higher accuracy using large LR than blue curve. • 1-cycle policy – one cycle that is smaller than number of iterations/epochs, then remaining iterations with LR lowered by several order of magnitude.
  • 9. REGULARIZATION • Many forms of regularization • Large Learning Rate • Small batch size • Weight decay (aka L2 regularization) • Dropout • Need to balance different regularizers for each dataset and architecture. • Fig 5b (previous slide) shows tradeoff between weight decay (WD) and LR. Large LR for faster learning needs to be balanced with lower WD. • General guidance: reduce other forms of regularization and train with high LR makes training efficient.
  • 10. BATCH SIZE • Larger batch sizes permit larger LR using 1cycle schedule. • Larger batch size may increase training time, so tradeoff required. • Tradeoff – use batch size so number of epochs is optimum for data/model. • Batch size limited by GPU memory. • Fig 6a shows validation accuracy for different batch sizes. Larger batch sizes better but effect tapers off (BS=1024 blue curve very close to BS=512 red curve).
  • 11. (CYCLIC) MOMENTUM • Set momentum as large as possible without causing instability. • Constant LR => use large constant momentum (0.9 – 0.99) • Cyclic LR => decrease cyclic momentum as cyclic LR increases during early to middle part of training (0.95 – 0.85). • Fig 8a – blue curve is constant momentum, red curve is decreasing CM and yellow curve is increasing CM (with increasing CLR). • These observations also carry over to deep networks (Fig 8b).
  • 12. WEIGHT DECAY • Cyclical WD not useful, should remain constant throughout training. • Value should be found by grid search (ok with early termination). • Fig 9a shows loss plots for different values of WD (with LR=5e- 3, mom=0.95). • Fig 9b shows equivalent accuracy plots.
  • 13. CYCLIC LEARNING RATE PAPER • Cyclical Learning Rates for Training Neural Networks • https://arxiv.org/abs/1506.01186 • Describes CLR in depth and describes results of training common networks with CLR.
  • 14. CYCLIC LEARNING RATE • Successor to • Learning rate schedules – varying LR exponentially over training. • Adaptive Learning Rates (RMSProp, ADAM, etc) – change LR based on values of gradients. • Based on observation that increasing LR has short-term negative effect but long-term positive effect. • Let LR vary between range of values. • Triangular LR (Fig 2) is usually good enough but other variants also possible. • Accuracy plot (Fig 1) shows CLR (red curve) is better compared to Exponential LR.
  • 15. ESTIMATING CLR PARAMETERS • Step size • Step size = 2 to 10 times * number of iterations per epoch • Number of training iterations per epoch = number of training records / batch size • Upper and lower bounds for LR • Run model for few epochs with some bounds (1e-4 to 2e-1 for example) • Upper bound == where accuracy stops increasing, becomes ragged, or falls (~ 6e-3). • Lower bound • Either 1/3 or ¼ of upper bound (~ 2e-3) • Point at which accuracy starts to increase (~ 1e-3)
  • 16. LR FINDER USAGE • LR Finder – first available in Fast.AI library. • Upper bound – between 1e-3 and 1e-2 (10-3 and 10-2) where loss is decreasing fastest. • Can also be found using lr.plot_loss_change() – minimum point (here 1e-2). • Lower bound is about 1-2 orders of magnitude lower. • LR Finder (Keras) – https://github.com/surmenok/keras_lr_finder • LR Finder (Pytorch) -- https://github.com/davidtvs/pytorch-lr-finder • Keras example -- https://github.com/sujitpal/keras-tutorial- odsc2020/blob/master/02_03_exercise_2_solved.ipynb • Fast. AI example -- https://colab.research.google.com/github/fastai/fastbook/blob/master/16_ac cel_sgd.ipynb