Learning to See by Listening Through Self-Supervised Multisensory Models

1. Learning to See by Listening Antonio Torralba, MIT, EECS, CSAIL

2. Medical applications Robotics Security Exciting times for computer vision… DrivingMobile devices Gaming

3. Sheep Airplane Bed Horse Few years ago…

4. Maybe it was the training data?

11. Maybe it was the image features?

12. CANNY Edge Detector

13. CANNY Edge Detector

14. Vondrick, Khosla, Malisiewicz, Torralba. “Inverting and Visualizing Features for Object Detection.” Person Chair Car HOG

15. Vondrick, Khosla, Malisiewicz, Torralba. “Inverting and Visualizing Features for Object Detection.” Person Chair Car HOG

19. Deep Neural Networks ResNet AlexNet GoogleNet VGG

20. Scene Description: Street, downtown. It is raining. Person walking Person carrying umbrella

22. Crowdsourcing

24. https://www.youtube.com/watch?v=AIEeakeXvMM

25. Raw pixels Raw audio Can we discover objects and words?

26. Unsupervised Learning of Spoken Language with Visual Context David Harwath Jim Glass Unsupervised Learning of Spoken Language with Visual Context. David Harwath, Antonio Torralba, James Glass. Advances in Neural Information Processing Systems (NIPS), 2016.

27. 382.060 Speech descriptions on Images from Places dataset. Crowdsourcing Audio-Visual Data

28. Crowdsourcing Audio-Visual Data 382.060 Speech descriptions on Images from Places dataset.

29. Joint Audio-Visual Architecture Harwath et al.

30. Harwath et al. Joint Audio-Visual Architecture

38. Vision Audition Touch Smell Taste Self-supervised system

39. Vision Audition Andrew Owens Visually Indicated Sounds. Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, William T. Freeman. Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

40. The Greatest Hits Dataset

41. Collecting a dataset of physical interactions • 977 videos • 46,577 segmented hits and scratches The Greatest Hits: Volume 1 dataset Glass Wood Plastic bagPlastic Rock Paper GravelGrass Leaf Ceramic ClothCarpet

42. Self-supervised learning Sounds time Frames

43. Frequency Time Frequency Time CNN RNN Predicting audio features

44. Predicted sound Adding soundtracks to silent videos

45. ` Predicted sound

46. Other actions?

47. Other actions?

48. Ambient sounds

49. Ambient sound

52. Movies contain rich visual and audio content

53. Self-supervised system Audio time Frames • Ambient Sound Provides Supervision for Visual Learning. Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. European Conference in Computer Vision (ECCV), 2016.

54. Self-supervised system Audio time Frames • Ambient Sound Provides Supervision for Visual Learning. Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. European Conference in Computer Vision (ECCV), 2016.

55. Audio time • Ambient Sound Provides Supervision for Visual Learning. Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. European Conference in Computer Vision (ECCV), 2016. Concepts emerge with self-supervision Frames

61. Audio time Concepts emerge with self-supervision Frames

64. Audio time Concepts emerge with self-supervision Frames

65. Audio categories Class activation map

66. Audio categories Class activation map

67. Self-supervised Learning Context prediction, ICCV’15 Solving puzzle, ECCV’16 Colorization, ECCV’16 and CVPR’17 Audio prediction, ECCV’16 Walker 2014, Mathieu 2015, Xue 2016, Finn 2016, Vondrick 2016, … Learning to predict the future Learning to track Wang 2015 Learning to do image impainting Prediction target Vincent 2008, Pathak 2016, Doersch 2015, …

68. Vision Audition Touch Smell Taste Self-supervised system

Learning to See by Listening Through Self-Supervised Multisensory Models

Recomendados

Recomendados

Más contenido relacionado

Similar a Learning to See by Listening Through Self-Supervised Multisensory Models

Similar a Learning to See by Listening Through Self-Supervised Multisensory Models (15)

Más de Amazon Web Services

Más de Amazon Web Services (20)

Último

Último (20)

Learning to See by Listening Through Self-Supervised Multisensory Models