The document discusses using self-supervised learning systems to learn visual and audio concepts from unlabeled data by predicting information like ambient sounds from video frames. It describes research on predicting sounds and discovering concepts without manual labels by learning relationships between visual scenes and audio. The goal is to develop systems that can learn like humans do from multi-sensory experiences.
26. Unsupervised Learning of Spoken Language with Visual Context
David Harwath Jim Glass
Unsupervised Learning of Spoken Language with Visual Context. David Harwath, Antonio Torralba, James Glass.
Advances in Neural Information Processing Systems (NIPS), 2016.
39. Vision
Audition
Andrew Owens
Visually Indicated Sounds. Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, William T. Freeman.
Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
53. Self-supervised system
Audio
time
Frames
• Ambient Sound Provides Supervision for Visual Learning. Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman,
and Antonio Torralba. European Conference in Computer Vision (ECCV), 2016.
54. Self-supervised system
Audio
time
Frames • Ambient Sound Provides Supervision for Visual Learning. Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman,
and Antonio Torralba. European Conference in Computer Vision (ECCV), 2016.
55. Audio
time
• Ambient Sound Provides Supervision for Visual Learning. Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman,
and Antonio Torralba. European Conference in Computer Vision (ECCV), 2016.
Concepts emerge with self-supervision
Frames
56. Audio
time
• Ambient Sound Provides Supervision for Visual Learning. Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman,
and Antonio Torralba. European Conference in Computer Vision (ECCV), 2016.
Concepts emerge with self-supervision
Frames
57. Audio
time
• Ambient Sound Provides Supervision for Visual Learning. Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman,
and Antonio Torralba. European Conference in Computer Vision (ECCV), 2016.
Concepts emerge with self-supervision
Frames
58. Audio
time
• Ambient Sound Provides Supervision for Visual Learning. Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman,
and Antonio Torralba. European Conference in Computer Vision (ECCV), 2016.
Concepts emerge with self-supervision
Frames
59. Audio
time
• Ambient Sound Provides Supervision for Visual Learning. Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman,
and Antonio Torralba. European Conference in Computer Vision (ECCV), 2016.
Concepts emerge with self-supervision
Frames
60. Audio
time
• Ambient Sound Provides Supervision for Visual Learning. Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman,
and Antonio Torralba. European Conference in Computer Vision (ECCV), 2016.
Concepts emerge with self-supervision
Frames
62. Audio
time
• Ambient Sound Provides Supervision for Visual Learning. Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman,
and Antonio Torralba. European Conference in Computer Vision (ECCV), 2016.
Concepts emerge with self-supervision
Frames
63. Audio
time
• Ambient Sound Provides Supervision for Visual Learning. Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman,
and Antonio Torralba. European Conference in Computer Vision (ECCV), 2016.
Concepts emerge with self-supervision
Frames