Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Deep Learning in Computer Vision

7.335 visualizaciones

Publicado el

Deep Learning in Computer Vision Applications
1. Basics on Convolutional Neural Network
2. Otimization Methods (Momentum, AdaGrad, RMSProp, Adam, etc)
3. Semantic Segmentation
4. Class Activation Map
5. Object Detection
6. Recurrent Neural Network
7. Visual Question and Answering
8. Word2Vec (Word embedding)
9. Image Captioning

Publicado en: Ingeniería
  • Dear Sir- Madame, Please let me know how I can be of help to you? Kind Regards, Fredrick. Fredrick Anold Director - Consultant idealtropical.com +1 813-713-0615 Hello, It is my pleasure to reach out to you with the following offer of appointment. Shenergy (Group) Company Limited, are currently seeking Reputable Company/Individual that can act as their Company Representative/Account Receivable Agent in Canada and in USA (Intermediary between Shenergy (Group) Co, Ltd and its clients in the Northern America region). If interested, kindly indicate your interest by responding directly to the Company's Deputy General Manager and Director details below: Xu Weiquan Deputy General Manager and Director Shenergy (Group) Co, Ltd E-mail: representative-department@shenergygroup.com.cn http://www.shenergy.com.cn/ Note: The job will only take few minutes of your time daily. You can Earn Extra income while doing your normal Job/Business. Management Shenergy (Group) CSS, Company Limited
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí

Deep Learning in Computer Vision

  1. 1. Introduction to Deep Learning Presenter: Sungjoon Choi (sungjoon.choi@cpslab.snu.ac.kr)
  2. 2. Optimization methods CNN basics Semantic segmentation Weakly supervised localization Image detection RNN Visual QnA Word2Vec Image Captioning Contents
  3. 3. What is deep learning? 3 “Deep learning is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using multiple processing layers, with complex structures or otherwise, composed of multiple non-linear transformations.” Wikipedia says: Machine Learning High-level abstraction Network
  4. 4. Is it brand new? 4 Neural Nets McCulloch & Pitt 1943 Perception Rosenblatt 1958 RNN Grossberg 1973 CNN Fukushima 1979 RBM Hinton 1999 DBN Hinton 2006 D-AE Vincent 2008 AlexNet Alex 2012 GoogLeNet Szegedy 2015
  5. 5. Deep architectures 5 Feed-Forward: multilayer neural nets, convolutional nets Feed-Back: Stacked Sparse Coding, Deconvolutional Nets Bi-Directional: Deep Boltzmann Machines, Stacked Auto-Encoders Recurrent: Recurrent Nets, Long-Short Term Memory
  6. 6. CNN basics
  7. 7. CNN 7 CNNs are basically layers of convolutions followed by subsampling and fully connected layers. Intuitively speaking, convolutions and subsampling layers works as feature extraction layers while a fully connected layer classifies which category current input belongs to using extracted features.
  8. 8. 8
  9. 9. 9
  10. 10. 10
  11. 11. 11
  12. 12. 12
  13. 13. 13
  14. 14. 14
  15. 15. 15
  16. 16. 16
  17. 17. Optimization methods
  18. 18. Gradient descent?
  19. 19. Gradient descent? There are three variants of gradient descent Differ in how much data we use to compute gradient We make a trade-off between the accuracy and computing time
  20. 20. Batch gradient descent In batch gradient decent, we use the entire training dataset to compute the gradient.
  21. 21. Stochastic gradient descent In stochastic gradient descent (SGD), the gradient is computed from each training sample, one by one.
  22. 22. Mini-batch gradient decent In mini-batch gradient decent, we take the best of both worlds. Common mini-batch sizes range between 50 and 256 (but can vary).
  23. 23. Challenges Choosing a proper learning rate is cumbersome.  Learning rate schedule Avoiding getting trapped in suboptimal local minima
  24. 24. Momentum
  25. 25. Nesterov accelerated gradient
  26. 26. Adagrad It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. 𝜃𝑡+1,𝑖 = 𝜃𝑡,𝑖 − 𝜂 𝐺𝑡,𝑖𝑖 + 𝜖 𝑔𝑡,𝑖 Performing larger updates for infrequent and smaller updates for frequent parameters.
  27. 27. Adadelta Adadelta is an extension of Adagrad that seeks to reduce its monotonically decreasing learning rate. It restricts the window of accumulated past gradients to some fixed size 𝑤. 𝐸 𝑔2 𝑡 = 𝛾𝐸 𝑔2 𝑡−1 + 1 − 𝛾 𝑔𝑡 2 𝐸 ∆𝜃2 𝑡 = 𝛾𝐸 ∆𝜃2 𝑡−1 + 1 − 𝛾 ∆𝜃𝑡 2 𝜃𝑡+1 = 𝜃𝑡 − 𝐸 ∆𝜃2 𝑡 + 𝜖 𝐸 𝑔2 𝑡 + 𝜖 𝑔𝑡 No learning rate!
  28. 28. Exponential moving average 28
  29. 29. RMSprop RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in his lecture.. 𝐸 𝑔2 𝑡 = 𝛾𝐸 𝑔2 𝑡−1 + 1 − 𝛾 𝑔𝑡 2 𝜃𝑡+1 = 𝜃𝑡 − 𝜂 𝐸 𝑔2 𝑡 + 𝜖 𝑔𝑡
  30. 30. Adam Adaptive Moment Estimation (Adam) stores both exponentially decaying average of past gradients and and squared gradients. 𝑚 𝑡 = 𝛽1 𝑚 𝑡−1 + 1 − 𝛽1 𝑔𝑡 𝑣 𝑡 = 𝛽2 𝑣 𝑡−1 + 1 − 𝛽2 𝑔𝑡 2 𝜃𝑡+1 = 𝜃𝑡 − 𝜂 𝑣 𝑡 + 𝜖 1 − 𝛽2 𝑡 1 − 𝛽1 𝑡 𝑚 𝑡 Momentum Running average of gradient squares
  31. 31. Adam Adaptive Moment Estimation (Adam) stores both exponentially decaying average of past gradients and and squared gradients. 𝑚 𝑡 = 𝛽1 𝑚 𝑡−1 + 1 − 𝛽1 𝑔𝑡 𝑣 𝑡 = 𝛽2 𝑣 𝑡−1 + 1 − 𝛽2 𝑔𝑡 2 𝜃𝑡+1 = 𝜃𝑡 − 𝜂 𝑣 𝑡 + 𝜖 1 − 𝛽2 𝑡 1 − 𝛽1 𝑡 𝑚 𝑡
  32. 32. Visualization
  33. 33. Semantic segmentation
  34. 34. Semantic Segmentation? lion dog giraffe Image Classification bicycle person ball dog Object Detection person person person person person bicyclebicycle Semantic Segmentation
  35. 35. Semantic segmentation 35
  36. 36. 36
  37. 37. 37
  38. 38. 38
  39. 39. 39
  40. 40. 40
  41. 41. 41
  42. 42. 42
  43. 43. 43
  44. 44. 44
  45. 45. Results 45
  46. 46. 46
  47. 47. 47
  48. 48. 48
  49. 49. 49
  50. 50. 50
  51. 51. 51
  52. 52. 52
  53. 53. 53
  54. 54. 54
  55. 55. 55
  56. 56. 56
  57. 57. 57
  58. 58. 58
  59. 59. 59
  60. 60. 60
  61. 61. 61
  62. 62. 62
  63. 63. 63
  64. 64. 64
  65. 65. 65
  66. 66. 66
  67. 67. 67
  68. 68. 68
  69. 69. 69
  70. 70. 70
  71. 71. 71
  72. 72. 72
  73. 73. Results 73
  74. 74. Results 74
  75. 75. Weakly supervised localization
  76. 76. Weakly supervised localization 76
  77. 77. Weakly supervised localization 77
  78. 78. Weakly Supervised Object Localization 78 Usually supervised learning of localization is annotated with bounding box What if localization is possible with image label without bounding box annotations? Today’s seminar: Learning Deep Features for Discriminative Localization 1512.04150v1 Zhou et al. 2015 CVPR2016
  79. 79. Architecture 79 AlexNet+GAP+places205 Living room 11x11 Avg Pooling: Global Average Pooling (GAP) 11x11x512 512 205 227x227x3
  80. 80. Class activation map (CAM) 80 • Identify important image regions by projecting back the weights of output layer to convolutional feature maps. • CAMs can be generated for each class in single image. • Regions for each categories are different in given image. • palace, dome, church …
  81. 81. Results 81 • CAM on top 5 predictions on an image • CAM for one object class in images
  82. 82. GAP vs. GMP 82 • Oquab et al. CVPR2015 Is object localization for free? weakly-supervised learning with convolutional neural networks. • Use global max pooling(GMP) • Intuitive difference between GMP and GAP? • GAP loss encourages identification on the extent of an object. • GMP loss encourages it to identify just one discriminative part. • GAP, average of a map maximized by finding all discriminative parts of object • if activations is all low, output of particular map reduces. • GMP, low scores for all image regions except the most discriminative part • do not impact the score when perform MAX pooling
  83. 83. GAP & GMP 83 • GAP (upper) vs GMP (lower) • GAP outperforms GMP • GAP highlights more complete object regions and less background noise. • Loss for average pooling benefits when the network identifies all discriminative regions of an object
  84. 84. 84
  85. 85. Concept localization 85 Concept localization in weakly labeled images • Positive set: short phrase in text caption • Negative set: randomly selected images • Model catch the concept, phrases are much more abstract than object name. Weakly supervised text detector • Positive set: 350 Google StreeView images that contain text. • Negative set: outdoor scene images in SUN dataset • Text highlighted without bounding box annotations.
  86. 86. Image detection
  87. 87. 87
  88. 88. 88
  89. 89. 89
  90. 90. 90
  91. 91. 91
  92. 92. 92
  93. 93. 93
  94. 94. 94
  95. 95. 95
  96. 96. 96
  97. 97. 97
  98. 98. 98
  99. 99. 99
  100. 100. 100
  101. 101. 101
  102. 102. Results 102
  103. 103. SPPnet
  104. 104. 104
  105. 105. 105
  106. 106. 106
  107. 107. 107
  108. 108. 108
  109. 109. 109
  110. 110. 110
  111. 111. 111
  112. 112. 112
  113. 113. Results 113
  114. 114. Results 114
  115. 115. Fast R-CNN
  116. 116. 116
  117. 117. 117
  118. 118. 118
  119. 119. 119
  120. 120. 120
  121. 121. 121
  122. 122. 122
  123. 123. 123
  124. 124. 124
  125. 125. 125
  126. 126. Faster R-CNN
  127. 127. 127
  128. 128. 128
  129. 129. 129
  130. 130. 130
  131. 131. 131
  132. 132. 132
  133. 133. 133
  134. 134. 134
  135. 135. 135
  136. 136. 136
  137. 137. 137
  138. 138. 138
  139. 139. Results 139
  140. 140. Results 140
  141. 141. R-CNN 141 Image Regions Resize Convolution Features Classify
  142. 142. SPP net 142 Image Convolution Features SPPRegions Classify
  143. 143. R-CNN vs. SPP net 143 R-CNN SPP net
  144. 144. Fast R-CNN 144 Image Convolution Features Regions RoI Pooling Layer Class Label Confidence RoI Pooling Layer Class Label Confidence
  145. 145. R-CNN vs. SPP net vs. Fast R-CNN 145 R-CNN SPP net Fast R-CNN
  146. 146. Faster R-CNN 146 Image Fully Convolutional Features Bounding Box Regression BB Classification FastR-CNN
  147. 147. R-CNN vs. SPP net vs. Fast R-CNN 147 R-CNN SPP net Fast R-CNN Faster R-CNN
  148. 148. 148 Results
  149. 149. 149
  150. 150. 150
  151. 151. 151
  152. 152. 152
  153. 153. RNN
  154. 154. Recurrent Neural Network 155 http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  155. 155. Recurrent Neural Network 156 http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  156. 156. LSTM comes in! 157 Long Short Term Memory This is just a standard RNN. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  157. 157. LSTM comes in! 158 Long Short Term Memory This is just a standard RNN.This is the LSTM! http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  158. 158. Overall Architecture 159 (Cell) state Hidden State Forget Gate http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Input Gate Output Gate Next (Cell) State Next Hidden State Input Output Output = Hidden state
  159. 159. The Core Idea 160 http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  160. 160. Visual QnA
  161. 161. VQA: Dataset and Problem definition 162 VQA dataset - Example Q: How many dogs are seen? Q: What animal is this? Q: What color is the car? Q: What is the mustache made of?Q: Is this vegetarian pizza?
  162. 162. Solving VQA 163 Approach [Malinowski et al., 2015] [Ren et al., 2015] [Andres et al., 2015] [Ma et al., 2015] [Jiang et al., 2015] Various methods have been proposed
  163. 163. DPPnet 164 Motivation Common pipeline of using deep learning for vision CNN trained on ImageNet Switch the final layer and fine-tune for the New Task In VQA, Task is determined by a question Observation:
  164. 164. DPPnet 165 Main Idea Switching parameters of a layer based on a question Dynamic Parameter Layer Question Parameter Prediction Network
  165. 165. DPPnet 166 Parameter Explosion Number of parameter for fc-layer (R): DynamicParameterLayer Question Feature Predicted Parameter M N Q P : Dimension of hidden state fc-layer N=Q×P R=Q×P×M Q=1000, P=1000, M=500 For example: R=500,000,000 1.86GB for single layer Number of parameters for VGG19: 144,000,000
  166. 166. DPPnet 167 Parameter Explosion Number of parameter for fc-layer (R): DynamicParameterLayer Question Feature Predicted Parameter M N Q P : Dimension of hidden state fc-layer Solution: R=Q×P×M R= N×M N=Q×P N<Q×P We can control N
  167. 167. DPPnet 168 Weight Sharing with Hashing Trick Weights of Dynamic Parameter Layer are picked from Candidate weights by Hashing Question Feature Candidate Weights fc-layer 0.11.2-0.70.3-0.2 0.1 0.1 -0.2 -0.7 1.2 -0.2 0.1 -0.7 -0.7 1.2 0.3 -0.2 0.3 0.3 0.1 1.2 DynamicParameterLayer Hasing [Chen et al., 2015]
  168. 168. DPPnet 169 Final Architecture End-to-End Fine-tuning is possible (Fully-differentiable)
  169. 169. DPPnet 170 Qualitative Results Q: What is the boy holding? DPPnet: surfboard DPPnet: bat
  170. 170. DPPnet 171 Qualitative Results Q: What animal is shown? DPPnet: giraffe DPPnet: elephant
  171. 171. DPPnet 172 Qualitative Results Q: How does the woman feel? DPPnet: happy Q: What type of hat is she wearing? DPPnet: cowboy
  172. 172. DPPnet 173 Qualitative Results Q: How many cranes are in the image? DPPnet: 2 (3) Q: How many people are on the bench? DPPnet: 2 (1)
  173. 173. How to combine image and question? 174
  174. 174. How to combine image and question? 175
  175. 175. How to combine image and question? 176
  176. 176. How to combine image and question? 177
  177. 177. How to combine image and question? 178
  178. 178. How to combine image and question? 179
  179. 179. How to combine image and question? 180
  180. 180. How to combine image and question? 181
  181. 181. Multimodal Compact Bilinear Pooling 182
  182. 182. Multimodal Compact Bilinear Pooling 183
  183. 183. Multimodal Compact Bilinear Pooling 184
  184. 184. Multimodal Compact Bilinear Pooling 185
  185. 185. MCB without Attention 186
  186. 186. MCB with Attention 187
  187. 187. Results 188
  188. 188. Results 189
  189. 189. Results 190
  190. 190. Results 191
  191. 191. Results 192
  192. 192. Results 193
  193. 193. Word2Vec
  194. 194. Word2vec? 195
  195. 195. 196
  196. 196. 197
  197. 197. 198
  198. 198. 199
  199. 199. 200
  200. 200. 201
  201. 201. 202
  202. 202. 203
  203. 203. 204
  204. 204. 205
  205. 205. 206
  206. 206. 207
  207. 207. 208
  208. 208. Image Captioning
  209. 209. Image Captioning? 210
  210. 210. Overall Architecture 211
  211. 211. Language Model 212
  212. 212. Language Model 213
  213. 213. Language Model 214
  214. 214. Language Model 215
  215. 215. Language Model 216
  216. 216. Training phase 217
  217. 217. Training phase 218
  218. 218. Training phase 219
  219. 219. Training phase 220
  220. 220. Training phase 221
  221. 221. Training phase 222
  222. 222. Test phase 223
  223. 223. Test phase 224
  224. 224. Test phase 225
  225. 225. Test phase 226
  226. 226. Test phase 227
  227. 227. Test phase 228
  228. 228. Test phase 229
  229. 229. Test phase 230
  230. 230. Test phase 231
  231. 231. Results 232
  232. 232. Results 233
  233. 233. But not always.. 234
  234. 234. 235
  235. 235. Show, attend and tell 236
  236. 236. 237
  237. 237. 238
  238. 238. 239
  239. 239. 240
  240. 240. Results 241
  241. 241. Results 242
  242. 242. Results (mistakes) 243
  243. 243. Neural Art
  244. 244. Preliminaries 245 Understanding Deep Image Representations by Inverting Them CVPR2015 Texture Synthesis Using Convolutional Neural Networks NIPS2015
  245. 245. A Neural Algorithm of Artistic Style 246
  246. 246. A Neural Algorithm of Artistic Style 247
  247. 247. 248 Texture Synthesis Using Convolutional Neural Networks -NIPS2015 Leon A. Gatys, Alexander S. Ecker, Matthias Bethge
  248. 248. Texture? 249
  249. 249. Visual texture synthesis 250 Which one do you think is real? Right one is real. Goal of texture synthesis is to produce (arbitrarily many) new samples from an example texture.
  250. 250. Results of this work 251 Right ones are given sources!
  251. 251. How? 252
  252. 252. Texture Model 253 𝑋 𝑎 Input a 𝐹𝑎 1 𝐹𝑎 2 𝐹𝑎 3 𝑋 𝑏 Input b 𝐹𝑏 1 𝐹𝑏 2 𝐹𝑏 3 number of filters
  253. 253. Feature Correlations 254 𝑋 𝑎 Input a 𝐹𝑎 1 𝐹𝑎 2 𝐹𝑎 3 𝑋 𝑏 Input b 𝐹𝑏 1 𝐹𝑏 2 𝐹𝑏 3 number of filters 𝐺 𝑎 2 = 𝐹𝑎 2 𝑇 𝐹𝑎 2 (Gram matrix)
  254. 254. Feature Correlations 255 𝐺 𝑎 2 𝐹𝑎 2 𝐹𝑎 2 = number of filters W*H 𝐹𝑎 2 𝐺 𝑎 2 = 𝐹𝑎 2 𝑇 𝐹𝑎 2 (Gram matrix) number of filters
  255. 255. Texture Generation 256 𝑋 𝑎 Input a 𝐹𝑎 1 𝐹𝑎 2 𝐹𝑎 3 𝑋 𝑏 Input b 𝐹𝑏 1 𝐹𝑏 2 𝐹𝑏 3 𝐺 𝑎 1 𝐺 𝑏 1 𝐺 𝑎 1 𝐺 𝑏 1 𝐺 𝑎 1 𝐺 𝑏 1
  256. 256. Texture Generation 257 𝑋 𝑎 Input a 𝐹𝑎 1 𝐹𝑎 2 𝐹𝑎 3 𝑋 𝑏 Input b 𝐹𝑏 1 𝐹𝑏 2 𝐹𝑏 3 𝐺 𝑎 1 𝐺 𝑏 1 𝐺 𝑎 1 𝐺 𝑏 1 𝐺 𝑎 1 𝐺 𝑏 1 Element-wise squared loss Total layer-wise loss function
  257. 257. Results 258
  258. 258. Results 259
  259. 259. 260 Understanding Deep Image Representations by Inverting Them -CVPR2015 Aravindh Mahendran, Andrea Vedaldi (VGGgroup)
  260. 260. Reconstruction from feature map 261
  261. 261. Reconstruction from feature map 262 𝑋 𝑎 Input a 𝐹𝑎 1 𝐹𝑎 2 𝐹𝑎 3 𝑋 𝑏 Input b 𝐹𝑏 1 𝐹𝑏 2 𝐹𝑏 3 number of filters Let’s make this features similar! By changing the input image!
  262. 262. Receptive Field 263
  263. 263. 264 A Neural Algorithm of Artistic Style Leon A. Gatys, Alexander S. Ecker, Matthias Bethge
  264. 264. How? 265 Style Image Content Image Mixed ImageNeural Art
  265. 265. How? 266 Style Image Content Image Mixed ImageNeural Art Texture Synthesis Using Convolutional Neural Networks Understanding Deep Image Representations by Inverting Them
  266. 266. How? 267 Gram matrix
  267. 267. Neural Art 268 𝑝: original photo, 𝑎: original artwork 𝑥: image to be generated Content Style Total loss = content loss + style loss
  268. 268. Results 269
  269. 269. Results 270
  270. 270. 271

×