Publicidad
Publicidad

Más contenido relacionado

Publicidad
Publicidad

Deep Style: Using Variational Auto-encoders for Image Generation

  1. Deep Style TJ Torres Data Scientist, Stitch Fix PyData NYC 2015 Using Variational Auto-encoders for Image Generation
  2. Data Labs
  3. Data Labs
  4. Data Labs
  5. Data Labs
  6. MOTIVATION Our goal at Stitch Fix Total Inventory Recommendation Algo Stylists Filtered Items 1 2 3 4 5 Final Items Sent
  7. COLD START PROBLEM New Clients New Clothing
  8. New Clients New Clothing 1. Get new clothing. 2. Get new clients. 3. ???????? 4.PROFIT!!! COLD START PROBLEM
  9. New Clients New Clothing 1. Get new clothing. 2. Get new clients. 3. ???????? 4.PROFIT!!! Preemptive Modeling COLD START PROBLEM
  10. TURN TO IMAGES • Style/fashion is primarily visual. • We wish to use images for modeling purposes. • Heuristics for how we process image data unknown or quite complex. • We don’t want to have to develop image features. • Turn to deep learning to learn the feature extraction.
  11. OUTLINE 1. Introduction to NNs 2. Unsupervised Deep Learning 3. Getting started with Chainer 4. Training a simple model.
  12. 1. Introduction to NNs 2. Unsupervised Deep Learning 3. Getting started with Chainer 4. Training a simple model. OUTLINE
  13. 1. Introduction to NNs 2. Unsupervised Deep Learning 3. Getting started with Chainer 4. Training a simple model. 5. Open source package! 6. Conclusions/Future (current) Directions OUTLINE
  14. NEURAL NETWORKS http://www.wired.com/2013/02/three-awesome-tools-scientists-may-use-to-map-your-brain-in-the-future/
  15. http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html
  16. Whoa Dude! http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html
  17. http://arxiv.org/pdf/1502.04623v2.pdf
  18. Begin with input: INTRO TO NEURAL NETS 1 2 3 4 5 6
  19. Begin with input: 1 2 3 4 layer 1 (Input) 5 6 layer 2 f (l) i (x) = tanh 0 @ X j W (l) ij x (l 1) j + b(l) 1 A INTRO TO NEURAL NETS
  20. Begin with input: 1 2 3 4 layer 1 (Input) 5 6 layer 2 f (l) i (x) = tanh 0 @ X j W (l) ij x (l 1) j + b(l) 1 A layer 3 (output) Transform data repeatedly with non-linear function. f(1) · · · f(n) (x) INTRO TO NEURAL NETS
  21. 1 2 3 4 layer 1 (Input) 5 6 layer 2 layer 3 (output) Calculate loss function and update weights f(1) · · · f(n) (x) L(xout, y) = MSE z }| { 1 m mX k=1 (xk yk)2 Begin with input: f (l) i (x) = tanh 0 @ X j W (l) ij x (l 1) j + b(l) 1 A Transform data repeatedly with non-linear function. INTRO TO NEURAL NETS
  22. 1 2 3 4 layer 1 (Input) 5 6 layer 2 layer 3 (output) L(xout, y) = MSE z }| { 1 m mX k=1 (xk yk)2 W (l)⇤ ij = W (l) ij ✓ 1 ↵ @L @Wij ◆ Calculate loss function and update weights f(1) · · · f(n) (x) Begin with input: f (l) i (x) = tanh 0 @ X j W (l) ij x (l 1) j + b(l) 1 A Transform data repeatedly with non-linear function. INTRO TO NEURAL NETS
  23. 1 2 3 4 layer 1 (Input) 5 6 layer 2 layer 3 (output) L(xout, y) = MSE z }| { 1 m mX k=1 (xk yk)2 W (l)⇤ ij = W (l) ij ✓ 1 ↵ @L @Wij ◆ @L @W (l) ij = ✓ @L @xout ◆ ✓ @xout @f(n 1) ◆ · · · @f(l) @W (l) ij ! Calculate loss function and update weights f(1) · · · f(n) (x) Begin with input: f (l) i (x) = tanh 0 @ X j W (l) ij x (l 1) j + b(l) 1 A Transform data repeatedly with non-linear function. INTRO TO NEURAL NETS
  24. WHY DEEP LEARNING? 1) With no hidden layers NN resemble just a linear transformation. 2) Shallow networks approximate PCA 3) Composing non-linear activation functions adds increasing nonlinearity. f(1) · · · f(n) (x) 4) Learn more complex/nonlinear models with deep architectures.
  25. DL WITH SUPERVISION Most deep learning methods rely on supervised training data. MO: Feature Extraction w/ Deep Learning Final Classification Layer(s) http://parse.ele.tue.nl/education/cluster2
  26. ISSUES FOR STYLE PROBLEM No reliable system of style labels for image data.
  27. Thankfully we can learn feature representations of unsupervised data. The key is to compress the data with a nonlinear encoding process. PROBLEM No reliable system of style labels for image data. ISSUES FOR STYLE
  28. UNSUPERVISED DEEP LEARNING
  29. UNSUPERVISED DEEP LEARNING
  30. AUTO-ENCODERS Two different processes combined into one. 1) Encoding (inferential) 2) Decoding (generative)
  31. Compressed Data Original Image Reconstructed Image Encode Decode Two different processes combined into one. 1) Encoding (inferential) 2) Decoding (generative) AUTO-ENCODERS
  32. Compressed Data Original Image Reconstructed Image Encode Decode AUTO-ENCODERS Training: 1) Initialize to random weights in layers. AUTO-ENCODERS
  33. Compressed Data Original Image Reconstructed Image Encode Decode AUTO-ENCODERS Training: 1) Initialize to random weights in layers. 2) Full forward pass of batch through encoding and then decoding of encoded rep. AUTO-ENCODERS
  34. Compressed Data Original Image Reconstructed Image Encode Decode AUTO-ENCODERS Training: 1) Initialize to random weights in layers. 2) Full forward pass of batch through encoding and then decoding of encoded rep. 3) Construct loss via MSE of original data to reconstructed data. AUTO-ENCODERS
  35. Compressed Data Original Image Reconstructed Image Encode Decode AUTO-ENCODERS Training: 1) Initialize to random weights in layers. 2) Full forward pass of batch through encoding and then decoding of encoded rep. 3) Construct loss via MSE of original data to reconstructed data. 4) Calculate gradients and backprop through to train new weights. AUTO-ENCODERS
  36. Compressed Data Original Image Reconstructed Image Encode Decode Training: 1) Initialize to random weights in layers. 2) Full forward pass of batch through encoding and then decoding of encoded rep. 3) Construct loss via MSE of original data to reconstructed data. 4) Calculate gradients and backprop through to train new weights. 5) Iterate. AUTO-ENCODERS
  37. AUTO-ENCODER ISSUES 1) AEs will often overfit unless amount of training data is large. 2) Gradients diminish quickly, thus weight corrections small “far away” from output.
  38. SOLUTION 1) Use variational component to “regularize” training. 2) *Not Covered* Stack auto-encoders and train greedily (DBN) 1) AEs will often overfit unless amount of training data is large. 2) Gradients diminish quickly, thus weight corrections small “far away” from output. AUTO-ENCODER ISSUES
  39. Passage MANY deep learning frameworks!!!
  40. Passage
  41. Easy-to-use framework for training Neural Networks. BASIC OBJECTS Variables Functions Wrapper on ndarrays. Operate on Variable objects Operations of functions on variables memorized in sequence. Back propagation done by simply automatic differentiation moving backwards through the sequence of operations. INTRO TO CHAINER
  42. x = np.ones(1)*5 y = np.ones(1)*3 x = chainer.Variable(x) y = chainer.Variable(y) z = x**2 + y**2 + 2*y INTRO TO CHAINER
  43. x = np.ones(1)*5 y = np.ones(1)*3 x = chainer.Variable(x) y = chainer.Variable(y) z = x**2 + y**2 + 2*y INTRO TO CHAINER
  44. x = np.ones(1)*5 y = np.ones(1)*3 x = chainer.Variable(x) y = chainer.Variable(y) z = x**2 + y**2 + 2*y In [3]: z.data Out[3]: array([ 40.]) INTRO TO CHAINER
  45. x = np.ones(1)*5 y = np.ones(1)*3 x = chainer.Variable(x) y = chainer.Variable(y) z = x**2 + y**2 + 2*y In [3]: z.data Out[3]: array([ 40.]) INTRO TO CHAINER
  46. x = np.ones(1)*5 y = np.ones(1)*3 x = chainer.Variable(x) y = chainer.Variable(y) z = x**2 + y**2 + 2*y In [3]: z.data Out[3]: array([ 40.]) #calculate gradients z.backwards() INTRO TO CHAINER
  47. Steps to NN 1. Define a model using chainer.FunctionSet 1. Contains all parametric functions. 2. Simple way to wrap computational elements into one object. 2. Design and code forward network pass. 3. Set optimizer: chainer.optimizers 4. Make a train script which iteratively passes batches forward through the network and updates the weights: optimizer.update() loss.backwards() INTRO TO CHAINER
  48. ADVANTAGES 1. Forward pass through networks are intuitive and easily debugged. 2. Can use arbitrary control flow statements. 3. Backpropagation easily implemented through backwards traversal of computational graph. 4. High level of readability. INTRO TO CHAINER
  49. BUILDING A SIMPLE AUTO-ENCODER
  50. MODEL SETUP #layer setup layers = {} #encoding layers layers[‘encode0’] = F.Linear(img_size, n0) layers[‘encode1’] = F.Linear(n0, 2*encoding_size) #decoding layers layers[‘decode0’] = F.Linear(encoding_size, n0) layers[‘decode1’] = F.Linear(n0, img_size) #model setup model = chainer.FunctionSet(**layers) optimizer = optimizers.Adam() optimizer.setup(model)
  51. ENCODING # Encoder input = chainer.Variable(input) input
  52. # Encoder input = chainer.Variable(input) input = F.relu(model.encode0(input)) input ENCODING
  53. # Encoder input = chainer.Variable(input) input = F.relu(model.encode0(input)) latent = F.relu(model.encode1(input)) latent ENCODING
  54. VARIATIONAL STEP sample from distribution # Variational layer mean, std = F.split_axis(latent, 2, 1) noise = np.random.standard_normal(mean.data.shape) }µ } q (z) = N(z; µ(i) , 2(i) I)
  55. VARIATIONAL STEP sampled # Variational layer mean, std = F.split_axis(latent, 2, 1) noise = np.random.standard_normal(mean.data.shape) sampled = noise * F.exp(0.5 * std) + mean
  56. DECODING # Decoder output = F.relu(model.decode0(sampled)) output
  57. DECODING # Decoder output = F.relu(model.decode0(sampled)) reconstruction = F.sigmoid(model.decode1(output)) reconstruction
  58. UPDATE # Loss is just RMSE loss = F.mean_squared_error(reconstruction, input) # “Regularize” the latent vector loss += F.gaussian_kl_divergence(mean, std) L(x) = DKL(q (z)||N(0, I)) + MSE(x, yout)
  59. UPDATE # Loss is just RMSE loss = F.mean_squared_error(reconstruction, input) # “Regularize” the latent vector loss += F.gaussian_kl_divergence(mean, std) #backprop optimizer.zero_grads() loss.backward() optimizer.update()
  60. AFTER TRAINING
  61. RESULTS Still testing the efficacy of modeling style with the encoded space. Normally, the generative portion would be thrown out after training, but here we can use it to look at our style space.
  62. TRY IT YOURSELF https://github.com/stitchfix/fauxtograph
  63. COMMAND LINE TOOL $ pip install fauxtograph $ fauxtograph download images/ $ fauxtograph train images/ models/model_out $ fauxtograph generate models/model_out generated_images/
  64. source: @genekogan
  65. FUTURE DIRECTIONS Issues with scaling to high resolution.
  66. For 100x200 RGB Image: 100x200x3 = 60000 node input layer 60,000x(step down layer 4000) = 240M 240M x 32-bits = ~ 960 MB FUTURE DIRECTIONS Issues with scaling to high resolution.
  67. Add Convolution Layers: 1) Reduce # of parameters. 2) Add translation robustness. 3) Hierarchical feature structure. FUTURE DIRECTIONS For 100x200 RGB Image: 100x200x3 = 60000 node input layer 60,000x(step down layer 4000) = 240M 240M x 32-bits = ~ 960 MB Issues with scaling to high resolution.
  68. Add Convolution Layers: 1) Reduce # of parameters. 2) Add translation robustness. 3) Hierarchical feature structure. FUTURE DIRECTIONS For 100x200 RGB Image: 100x200x3 = 60000 node input layer 60,000x(step down layer 4000) = 240M 240M x 32-bits = ~ 960 MB Issues with scaling to high resolution. COMING SOON
  69. CONCLUSIONS 1) Style feature space would help resolve cold-start problem for both clients and items. 2) Auto-encoders are useful for deducing feature space in an unsupervised way. 3) Turn to VAE for drag and drop way to prevent overfitting. 4) Convolution on it’s way. You can check out the branch: convolutional-vae
  70. QUESTIONS? Original VAE Paper: http://arxiv.org/abs/1312.6114 Blog Post: http://multithreaded.stitchfix.com/blog/2015/09/17/deep-style/
  71. APPENDIX: VARIATIONAL INFERENCE Want to solve for posterior: p✓(z|x) = p✓(x|z)p✓(z) p✓(x) But posterior can be intractable to calculate efficiently. Approximate p✓(z|x) ⇡ q (z) Minimize KL Divergence DKL (q (z)||p✓(z|x)) = Z dz q (z) ln ✓ q (z) p✓(z|x) ◆
  72. APPENDIX: VARIATIONAL AUTO-ENCODER Auto-encoder learns/infers in the Bayesian sense too. Learning encoding is equivalent to maximizing likelihood: argmax z p✓(x|z) And generating decoding by maximizing posterior: argmax x p✓(z|x) Apply variational inference at the decoding step to calculate posterior.
  73. Auto-encoder now models distributions for latent space. If we guess a normal form for our “variational distribution” … APPENDIX: VARIATIONAL AUTO-ENCODER
  74. DKL (q (z)||p✓(z|x)) = log 2 1 + 2 1 2 2 + (µ1 µ2) 2 2 2 2 Auto-encoder now models distributions for latent space. If we guess a normal form for our “variational distribution” … APPENDIX: VARIATIONAL AUTO-ENCODER
  75. DKL (q (z)||p✓(z|x)) = log 2 1 + 2 1 2 2 + (µ1 µ2) 2 2 2 2 L2 Loss Auto-encoder now models distributions for latent space. If we guess a normal form for our “variational distribution” … APPENDIX: VARIATIONAL AUTO-ENCODER
  76. DKL (q (z)||p✓(z|x)) = log 2 1 + 2 1 2 2 + (µ1 µ2) 2 2 2 2 L2 Loss = X i ✓ 1 2 ⇥ 2 i + µ2 i 1 ⇤ log i ◆ Auto-encoder now models distributions for latent space. If we guess a normal form for our “variational distribution” … APPENDIX: VARIATIONAL AUTO-ENCODER
  77. DKL (q (z)||p✓(z|x)) = log 2 1 + 2 1 2 2 + (µ1 µ2) 2 2 2 2 L2 Loss = X i ✓ 1 2 ⇥ 2 i + µ2 i 1 ⇤ log i ◆ Drop in loss term to regularize latent space! Auto-encoder now models distributions for latent space. If we guess a normal form for our “variational distribution” … APPENDIX: VARIATIONAL AUTO-ENCODER
Publicidad