Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Matt Feiszli at AI Frontiers : Video Understanding

413 visualizaciones

Publicado el

I will discuss the state of the art of video understanding, particularly its research and applications at Facebook. I will focus on two active areas: multimodality and time. Video is naturally multi-modal, offering great possibility for content understanding while also opening new doors like unsupervised and weakly-supervised learning at scale. Temporal representation remains a largely open problem; while we can describe a few seconds of video, there is no natural representation for a few minutes of video. I will discuss recent progress, the importance of these problems for applications, and what we hope to achieve.

Publicado en: Tecnología
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Matt Feiszli at AI Frontiers : Video Understanding

  1. 1. VIDEO UNDERSTANDING Matt Feiszli Research Scientist / Manager Facebook AI
  2. 2. AI @FACEBOOK Research Tools Platforms Product
  3. 3. VIDEO @FACEBOOK SEE NOTES Facebook AI Mobile Vision Video ML Integrity FRL (AR / VR)
  4. 4. Make it Relevant Understand • What is this about? • What’s the language? • Who’s in it? • Where does it take Personalize • Who wants to see this? • Which part(s)? Deliver • Highest possible quality • Many possible devices • Variety of bandwidths
  5. 5. HUMAN-LEVEL UNDERSTANDING (by watching)
  6. 6. OCRAUDIOVISION SPEECH
  7. 7. TIME motion & change
  8. 8. WHERE ARE WE NOW? o Multimodal, temporal signal o Idea: Novel tasks replace labels • Language + vision • Audio as labels for video o Aspirations vs. reality?
  9. 9. RETRIEVAL & RANKING o Watch / no-watch: first few minutes • Should “understand” several minutes o Goal: Long-form content representation o Reality: Metadata is strongest signal. • Topic tagging • People, places, activities, brands
  10. 10. GREAT MOMENTS o Video: Boredom punctuated by greatness • Highlight reels, summaries • Objectionable content o Can find some moments. • Highly multimodal. o Complex actions, intents are a mystery.
  11. 11. Visuotemporal Structure
  12. 12. Action: Doing Pushups
  13. 13. Correspondence
  14. 14. o Action labels have temporal structure • Pushups: two key poses, two transitions • Compare: “Baking a cake” o Current visual models tend to ignore this • Instead: correlated objects, scenes, etc Temporal Structure
  15. 15. o Speech recognition • Words -> phonemes -> features • Modern models mostly learn this o Not without ambiguity, but… • … far better than actions Temporal Structure Macquarie University, Dept. of Linguistics, “Vowel Spectra”
  16. 16. o Goal: self-supervision (“free supervision”) o Examples: • Compression (e.g. autoencoders) • Neighboring image patches • Temporal ordering • Audio-visual matching Towards self-supervision?
  17. 17. Self-Supervised Learning with Audio and Video Temporal Synchronization Bruno Korbar, Du Tran, Lorenzo Torresani NIPS 2018
  18. 18. Arandjelovic & Zisserman ICCV’17 L^3 Net Related Work
  19. 19. Audio-Video Temporal Synchronization (AVTS)
  20. 20. Sound Localization
  21. 21. o Goal: rich features via extremely large label spaces o “Extremely large label space”? • Verbs + objects? • Combinations of attributes? • Natural language? What is a Label (at Scale)?
  22. 22. Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, Laurens van der Maaten. o SOA on ImageNet1K, 85.4% Top1 accuracy • Architecture – ResNext101-32x48 • Data – 3.5B Images • Labels – 17K classes • Training – 300 GPUs distributed training • Supervision – Weakly supervised Extreme Scale: Exploring the Limits of Supervised Pretraining
  23. 23. o Transfer learning from 100M videos? • Already setting new SOA on Kinetics, Epic Kitchens, etc. o Temporal models? o Labels? • Size of label space • Objects, actions, etc. Extreme Scale: Learnings from Video (to be published)
  24. 24. Is a toy car a car?
  25. 25. Thank you!

×