8. ★ Machine vision models often require large amounts of labeled data to
train well
★ Existing labelled datasets can be too generic and have a broad concept
space for our purposes
9. ★ Machine vision models often require large amounts of labeled data to
train well
★ Existing labelled datasets can be too generic and have a broad concept
space for our purposes
10. ImageNet
14 million+ images of 21K+ class entities
YouTube-8M
450K+ hours of 4700+ class entities
Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg
and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition
Challenge. IJCV, 2015.
Abu-El-Haija, Sami, et al. "YouTube-8M: A large-scale video classification
benchmark." arXiv preprint arXiv:1609.08675 (2016).
11. ImageNet
14 million+ images of 21K+ class entities
YouTube-8M
450K+ hours of 4700+ class entities
Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg
and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition
Challenge. IJCV, 2015.
Abu-El-Haija, Sami, et al. "YouTube-8M: A large-scale video classification
benchmark." arXiv preprint arXiv:1609.08675 (2016).
12. ★ Graphics have become
extremely realistic over the
years
★ Games are codeable, enabling
complex simulations
★ Simulating in-game helps you
ignore low level tasks like
movement animations and
routing
13. ★ Graphics have become
extremely realistic over the
years
★ Games are codeable, enabling
complex simulations
★ Simulating in-game helps you
ignore low level tasks like
movement animations and
routing
14. ★ Graphics have become
extremely realistic over the
years
★ Games are codeable, enabling
complex simulations
★ Simulating in-game helps you
ignore low level tasks like
movement animations and
routing
15. ★ Rockstar Advanced Game
Engine’s (RAGE) super realistic
graphics
★ Huge modding community
provides lots of customization
★ Programmatically configurable
options
16. ★ Rockstar Advanced Game
Engine’s (RAGE) super realistic
graphics
★ Huge modding community
provides lots of customization
★ Programmatically configurable
options
17. ★ Rockstar Advanced Game
Engine’s (RAGE) super realistic
graphics
★ Huge modding community
provides lots of customization
★ Programmatically configurable
options
19. ★ Programmatically configurable
options
○ We can generate entities of choice
in-game and have them perform
complex actions
○ Vehicles: driving, turning, waiting at
stoplights
○ People: entering/exiting vehicles,
waiting to cross the street, parking
○ Environment: weather, time of day,
camera elevation, zoom
20. ★ Grand Theft Auto Dataset:
○ Video footage
○ Objects of interest per frame
(vehicles and pedestrians)
○ Object location information
(bounding box information)
○ Text Descriptions
(e.g. a white truck is turning left)
21. CNNS
★ Extracts features from the input image,
distilled down to class predictions
★ Preserves spatial relationship between
pixels
Bird
Airplane
Superman
Car
35. CNNS
★ Extracts features from the input image,
distilled down to class predictions
★ Preserves spatial relationship between
pixels
Bird
Airplane
Superman
Car
36. ★ YOLO9000 (YOLO v2) is a real time object
detection convolutional neural network
architecture
★ Redmon, Joseph and Farhadi, Ali. "YOLO9000:
better, faster, stronger." arXiv (2017).
37. ★ YOLO9000 (YOLO v2) is a real time object
detection convolutional neural network
architecture
★ Redmon, Joseph and Farhadi, Ali. "YOLO9000:
better, faster, stronger." arXiv (2017).
42. RNNs
★ Works well with sequential input (e.g. words in
a sentence or a vector of numbers representing
an image)
★ For a given input, incorporates a “feedback”
loop of the information it received and the
decision it made from the previous input in the
sequence
Neural
Network
Output
Input
43. “e”
“h”
Vocabulary of 4 letters:
h e l o
Letters could be encoded as:
h [1 0 0 0]
e [0 1 0 0]
l [0 0 1 0]
o [0 0 0 1]
h
e
e l
l l
l
o
49. Attention
★ Train model to focus on salient objects in
the image
★ Instead of feeding features from the
entire image to an RNN, just feed the
salient region’s features
54. Search: “red truck”
Search by Text in Video
★ Extracting captions from video and store
them in an index
★ Fast video search by text query over large
amounts of video
55. Search by Example in Video
★ A user-defined bounding box on a video
frame
★ Query for similar objects of interest in the
entirety of a video dataset, at the frame
level
56. Search by Example in Video
★ A user-defined bounding box on a video
frame
★ Query for similar objects of interest in the
entirety of a video dataset, at the frame
level
57. ★ GTA V allows us to create fully annotated, custom tailored,
photorealistic datasets
★ We can use this dataset to train models that are good at object
detection/localization, captioning, and search by example or text for
overhead video
★ The use of models trained on GTA data also has applicability in areas
such as real-time security camera alerting and self driving cars
58. ★ GTA V allows us to create fully annotated, custom tailored,
photorealistic datasets
★ We can use this dataset to train models that are good at object
detection/localization, captioning, and search by example or text for
overhead video
★ The use of models trained on GTA data also has applicability in areas
such as real-time security camera alerting and self driving cars
59. ★ GTA V allows us to create fully annotated, custom tailored,
photorealistic datasets
★ We can use this dataset to train models that are good at object
detection/localization, captioning, and search by example or text for
overhead video
★ The use of models trained on GTA data also has applicability in areas
such as real-time security camera alerting and self driving cars