Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Optimization as a Model for Few-Shot Learning - ICLR 2017 reading seminar

26.962 visualizaciones

Publicado el

These slides is presented in ICLR 2017 reading seminar @ Shibuya Hikarie, Tokyo, Japan

Publicado en: Ciencias
  • Sé el primero en comentar

Optimization as a Model for Few-Shot Learning - ICLR 2017 reading seminar

  1. 1. 2016/06/17 @ DeNA, Shibuya Hikarie Hokuto Kagaya (@_hokkun_) Optimization as a Model for Few-Shot Learning 1
  2. 2. TL; DR • Purpose • Better inference for few-shot/one-shot learning problem • Method • Meta-learning based on LSTM of deep neural network • Result • Competitive with deep metric-learning techniques 2
  3. 3. Background (1) • Why deep learning succeeded? • machine power • amount of data • Large Dataset • ImageNet (Image) • Microsoft COCO Captions (Image & Caption) • YouTube 8M (Video) • WikiText (Text) 3
  4. 4. Background (2) • However, In many fields, to collect a large amount of training samples is: • difficult • Ex: Fine-grained recognition (car, bird, food..) • time-consuming • scraping, crawling, annotating… • Actually human beings can generalize using few samples of targets. 4
  5. 5. Problem & Purpose (1) • How can we acquire generalized model using few samples and a set number of updates? • Existed gradient-based training algorithm (SGD, ADAM, AdaGlad..) does not fit the problem with a set number of parameter updates. • In other simple words, authors want to find good initial parameters of NN. • cf) review comments: it is much better to be able to find architectural parameters of NN. 5
  6. 6. Problem & Purpose (2) • How? • Meta learning • Learning to learn. Train learner itself. • A variety of meta learning • Transfer learning • Use the experiences of different domain • Popular in the field of image classification, especially for fine-grained visual classification • Ensemble classifier • combine multiple classifier 6 - This article is very good to understand meta learning - http://http://www.scholarpedia.org/article/Metalearning
  7. 7. Proposed method • LSTM-based meta learning 7
  8. 8. * Prerequisites • What is LSTM? • Long-time Short Term Memory • 時系列を扱いたい、でも誤差が発散/消失しちゃう • 過去のデータの重みを1にして忘れないようにした上 で、選択的に⼊⼒/出⼒を⾏うようにした (ʻ97) • しかし急激な状況の変化(?)に対応できなかったの で、忘却ゲートを設置することで選択的に過去のデー タの記憶を消去できるようにした (ʼ99) • 参考(⽇本語) • http://qiita.com/t_Signull/items/21b82be280b46f467d1b 8
  9. 9. * Data Separation • meta-train dataset • meta-test dataset • meta sample 9 target training samples target testing samples one meta sample (a.k.a. episode)
  10. 10. Proposed Method (2) 𝜃" = 𝜃"$% + 𝛼∇)*+, ℒ" 𝑐" = 𝑓"⨀𝑐"$% + 𝑖"⨀𝑐̅" where 𝑖" = 𝜎(𝑊6 7 ? + b9) 𝑓" = 𝜎 𝑊; 7 ? + b< where ? = current gradients, current loss, previous 𝜃, previous itself (𝑖, 𝑓) 10 Normal SGD Metaphor not constant 1, to escape from bad local optima
  11. 11. Proposed Method (3) 11 ←meta learner‘s iteration ←learner‘s iteration (meta) loss value is computed by final state of LSTM (= parameters of target model) and 𝐷"?@"’s data and labels.
  12. 12. Proposed Method (4) 12 • From authorʼs slide
  13. 13. Proposed Method (5) • What will be improved gradually? • First: LSTM parameter (a.k.a. meta-learner parameters) • that is, ”how should we update target models?” • Second: LSTM states (outputs?) • Final 𝜃A is shared among each batch, so learning proceeds rapidly thanks for good initialization 13
  14. 14. Other Topics • coordinate-wise LSTM • Preprocessing to LSTM inputs • about both topics, see [Andrychowicz, NIPS 2016] (preprocessing is in appendix) • adjust the scaling of gradients and losses • separate info of magnitude and sign • Batch normalization • avoid ”dataset” (episode) level leakage of information • Related work: metric learning • ex: Siamese network 14
  15. 15. Evaluation Method • Baseline 1: nearest neighbor • meta-train: train neural network using all sample • meta-test: training sample をNNにぶちこんだ結果と testing sample のそれを⽐較 • Baseline 2: fine-tune • meta-train: 1 に加えて、 meta-validation dataset を hyper parameter の探索に使い、1 の network を finetune する • Baseline 3: Matching network • 距離学習のSOTA 15
  16. 16. Evaluation Result 16
  17. 17. Visualization and Insight • input gates • 1. different among datasets • = meta-learner isnʼt simply learning a fixed optimization strategy • 2. different among tasks • = meta-learner has used different ways to solve each setting • forget gates • simple decay • 結局ほとんど constant 17
  18. 18. Visualization and Insight 18
  19. 19. Conclusion • Found LSTM-based model to learn a learner, which is inspired by a metaphor between SGD updates and LSTM. • Train meta-learner to discover: • 1. Good initialization of learner • 2. Good mechanism for updating learnerʼs parameters. • competitive experimental result with SOTA metric learning methods. 19
  20. 20. Future work • few samples / lots of classes • more challenging scenarios • from review comment • it is much better to be able to find architectural parameters of NN. 20
  21. 21. 所感 • transfer learning における「別ドメインの経験 を活かす」という作業を「時系列の学習」的に捉 えて LSTM モデルとして学習した、というのは ⾃然に思えた • すでにあった発想?時間なく関連研究まで読み込め ず。。 • review comment にあった、構造の最適化まで できるとすごくよさそうだと思った • シンプルなフィルタをたくさん重ねるといいという話 もあるが。。 • 学部時代 cuda-convnet を使ってたくさんハイパパラ メータを試した苦労が蘇った 21
  22. 22. 多分わかってないこと • 結局、この論⽂で初めてわかったのはどこ? LSTM を learning-to-learn に使ったのは多分初 めてじゃない? • 例えば Andrychowicz+ʼ16 では、勾配を⼊⼒にして target learner の parameter updates を出⼒する LSTM を学習 • パラメタそのものを直接出⼒してるところ? 22

×