SlideShare a Scribd company logo
1 of 3
Download to read offline
Linear Recurrent Convolutional Networks for Segment-Based
Multiple Object Tracking
Erick Lin, Amirreza Shaban, Dr. Byron Boots
Robot Learning Lab, Institute for Robotics and Intelligent Machines
Introduction
Automatic object tracking in moving images has remained a long-standing problem in the
domain of computer vision, yet it is of paramount practical importance in many day-to-
day scenarios, especially those involving surveillance or human-computer interaction. The
application of eye-tracking technology, for example, has enabled insights into how humans
process visual information such as text, which have led to the development of more effective
methods of diagnostics as well as accessible digital interfaces [2].
Regarded as a problem complementary to object recognition, or the process of identifying
objects in still images by the pixels that compose them, object tracking focuses on the sub-
sequent task of matching these objects in one image to the same objects whose appearances
differ slightly in another image, whether in position, lighting, or even events such as occlu-
sion by objects closer to the foreground. Objects that can be tracked include any tangible
items, people, landmarks, or even parts of other objects that are considered by humans
visually to be separate entities [8].
Motivation
Object recognition and object tracking have both been framed in the context of machine
learning. For object recognition, learning models traditionally take the form of convolutional
neural networks (CNNs), which have been highly favored due to their lower training times
compared to similar models and their ability to take advantage of object locality, the prop-
erty that pixels that make up an object share the same neighborhood. In object recognition
for a single image, convolutional neural networks are often used to output the set of all
the superpixels, or groups of pixels that are similar in location and color, in that image.
Afterward, one of a variety of robust methods such as the POISE [4] algorithm, which has
been successful at addressing the problem of recognizing objects located far from image
boundaries, is used to merge superpixels into segments, which are intended to represent
whole objects.
Learning models that serve the purpose of object tracking have seen especially swift
progress in recent years, with recent breakthroughs in computational efficiency being made
through the use of linear regression and greedy matching techniques [6]. We will utilize
a variation of recurrent neural networks (RNNs), which are characterized by one or more
direct feedback loops from outputs to inputs, to build a fast learning model for tracking all
the visible objects in an image over time. While models based on a form of RNN known
as the long short-term memory (LSTM) network have performed successfully on tasks such
as annotating individual images with English-language descriptions [1], a shortcoming of
LSTM networks is that they are composed of nonlinear transformations on input data, so
these models require larger quantities of training data to avoid the statistical problem of
overfitting, and are hence also more time-consuming to train. Thus, LSTM networks are
infeasible for the large dataset sizes typically associated with moving images; on the other
hand, by being composed exclusively of linear transformations, our RNN architecture for
multiple object tracking is intended to circumvent this issue.
1
Objective
We have thus far prototyped our current design of the linear recurrent convolutional neu-
ral network model using the open-source deep learning framework Caffe. For classifying
superpixels, we will use a “deep” or many-layered convolutional neural network implemen-
tation such as AlexNet [5] which performs well on high-resolution images. We will then
perform the image processing technique of average pooling of the superpixels – that is, av-
eraging the characteristics of the pixels in each superpixel to render it uniform and thereby
better-contrasting with other superpixels for subsequent processing. Next, we will run a
segmentation procedure such as the previously mentioned POISE algorithm, and follow this
with average pooling of the segments by their superpixels.
Next, a sequence of fully connected neural networks (NNs), a more classic architecture
which serves a wider variety of purposes in machine learning, will learn and then perform
the nonlinear mappings of the segmentation data, making it possible for our object tracking
computations to retain their linearity. This step is justifiable in spite of the relative expense
of NNs and the aforementioned shortcomings of nonlinear models, because the segmenta-
tion data is much smaller in size than the original image data. Our entire architecture is
summarized below for convenience.
Input → CNN → Pooling + POISE → Nonlinear NN → Tracker → Output
The centerpiece of our architecture, the newly proposed linear recurrent neural network, is
referred to as the Tracker layer. Our prototype of the Tracker layer so far is governed by
the following equations, which are also visualized in a network diagram.
Eq (5)
Eq (4)
Eq (2)Eq (1)
Eq (6)
Eq (3)
Ht = Ht−1 + Xt MtXt (1)
Ct = Ct−1 + ˜Vt MtXt (2)
˜Vt = φ1(Vt ) (3)
Vt = Wt−1Xt (4)
Wt−1 = Ct−1(Ht−1 + λst−1I)−1
(5)
Mt = δ(φ2(Vt )) (6)
st = st−1 + σ(Mt) (7)
In these equations, t is the number of the current frame and can be seen as a time
parameter, Xt is the primary input matrix whose rows represent segments, λ is a regular-
ization constant commonly used in machine learning to prevent overfitting, and Ht and Ct
are hidden and memory cell units, respectively, which account for the cumulative nature of
previously seen examples. φ1 is given by the operation that keeps only the maximum value
in each row of a matrix while zeroing out the rest, δ converts an n-dimensional vector into
an n-by-n diagonal matrix, and σ sums all the elements in a matrix. The primary output
is ˜Vt, which encodes the best matchings between the existing segments and the segments in
the current frame.
As of now, φ2 and its parametrization remain unknown, and equation (5) still involves
a matrix inverse operation, which is known to be computationally expensive. Thus, one of
my primary objectives will be to work out the remaining details and modify the design of
the Tracker layer in order to further improve its efficiency and accuracy – in the case of the
matrix inverse, I will need to consider faster approximation methods that are feasible given
our knowledge of the structure of the Xt matrix.
In addition, the previous parts of the pipeline currently require training in a supervised
manner for a performance boost. This involves using a set of input image sequences paired
2
with ground truth labels, or the absolute segmentations for each image that are known to
be correct; I will obtain this data by applying the POISE segmentation proposal algorithm
to the publically-available Sintel dataset, which contains a collection of video sequences
originating from the open-source computer animated film of the same name. The Sintel
dataset also includes the ground truth optical flow for each image, which describes pixel-
wise movement from the current image to the next. In order to match ground truth segments
from each frame to the next, I will need to write algorithms that combine the segmentation
and optical flow data for any frame to produce the predicted superpixels and segments for
the next frame with occlusion handling, then match the predicted with the ground truth
superpixels and segments of the actual subsequent frame by their overlap, or the size of their
pairwise intersection divided by the size of their pairwise union.
We may consider adding an additional phase following the Tracker layer which further
improves segmentation results by using known refinement techniques such as composite
statistical inference (CSI) [6]. Finally, we will compare the performances of our linear
recurrent convolutional network on various metrics with the currently established video
segmentation benchmarks [3].
Conclusion
In this proposal, I have described a linear RNN-based model which may outperform the
state-of-the-art approaches in object tracking and mark the first appearance of such a class
of models for this specific application. Our ideal end goal is a multiple object tracking
system that works in real time with incoming video streams up to some certain resolution,
a tool which would prove beneficial for many critical as well as everyday settings.
References
[1] Donahue, J., Hendricks, L. A., Guadarrama, S., and Rohrbach, M. Long-
term recurrent convolutional networks for visual recognition and description. In Neural
Information Processing Systems (2007).
[2] Duchowski, A. T. A breadth-first survey of eye tracking applications. Behavior
Research Methods, Instruments, and Computers (2002).
[3] Galasso, F., Nagaraja, N. S., Ca’rdenas, T. J., Brox, T., and Schiele, B. A
unified video segmentation benchmark: Annotation, metrics, and analysis. In Computer
Vision and Pattern Recognition (2013).
[4] Humayun, A., Li, F., and Rehg, J. M. The middle child problem: Revisiting para-
metric min-cut and seeds for object proposals. In International Conference on Computer
Vision (2015).
[5] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with
deep convolutional neural networks. In Neural Information Processing Systems Confer-
ence (2012).
[6] Li, F., Kim, T., Humayun, A., Tsai, D., and Rehg, J. M. Video segmentation by
tracking many figure-ground segments. In International Conference on Computer Vision
(2013).
[7] Long, J., Shelhamer, E., and Darrell, T. Fully convolutional networks for
semantic segmentation. In Computer Vision and Pattern Recognition (2015).
[8] Luo, W., Xing, J., Zhang, X., Zhao, X., and Kim, T.-K. Multiple object tracking:
A literature review. ACM Computing Surveys (2015).
3

More Related Content

What's hot

A comprehensive survey of contemporary
A comprehensive survey of contemporaryA comprehensive survey of contemporary
A comprehensive survey of contemporary
prjpublications
 
Predictive Metabonomics
Predictive MetabonomicsPredictive Metabonomics
Predictive Metabonomics
Marilyn Arceo
 
Neural Networks Ver1
Neural  Networks  Ver1Neural  Networks  Ver1
Neural Networks Ver1
ncct
 
Automated Neural Image Caption Generator for Visually Impaired People
Automated Neural Image Caption Generator for Visually Impaired PeopleAutomated Neural Image Caption Generator for Visually Impaired People
Automated Neural Image Caption Generator for Visually Impaired People
Christopher Mehdi Elamri
 
Image compression and reconstruction using a new approach by artificial neura...
Image compression and reconstruction using a new approach by artificial neura...Image compression and reconstruction using a new approach by artificial neura...
Image compression and reconstruction using a new approach by artificial neura...
Hưng Đặng
 

What's hot (20)

Neural Networks: Introducton
Neural Networks: IntroductonNeural Networks: Introducton
Neural Networks: Introducton
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learning
 
Analysis of image storage and retrieval in graded memory
Analysis of image storage and retrieval in graded memoryAnalysis of image storage and retrieval in graded memory
Analysis of image storage and retrieval in graded memory
 
An efficient technique for color image classification based on lower feature ...
An efficient technique for color image classification based on lower feature ...An efficient technique for color image classification based on lower feature ...
An efficient technique for color image classification based on lower feature ...
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural Networks
 
Segmentation arxiv search_paper
Segmentation arxiv search_paperSegmentation arxiv search_paper
Segmentation arxiv search_paper
 
A comprehensive survey of contemporary
A comprehensive survey of contemporaryA comprehensive survey of contemporary
A comprehensive survey of contemporary
 
Human Detection and Tracking System for Automatic Video Surveillance
Human Detection and Tracking System for Automatic Video SurveillanceHuman Detection and Tracking System for Automatic Video Surveillance
Human Detection and Tracking System for Automatic Video Surveillance
 
Predictive Metabonomics
Predictive MetabonomicsPredictive Metabonomics
Predictive Metabonomics
 
Neural Networks Ver1
Neural  Networks  Ver1Neural  Networks  Ver1
Neural Networks Ver1
 
Convolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetConvolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNet
 
Automated Neural Image Caption Generator for Visually Impaired People
Automated Neural Image Caption Generator for Visually Impaired PeopleAutomated Neural Image Caption Generator for Visually Impaired People
Automated Neural Image Caption Generator for Visually Impaired People
 
Cnn
CnnCnn
Cnn
 
Neural networks
Neural networksNeural networks
Neural networks
 
Mobile Network Coverage Determination at 900MHz for Abuja Rural Areas using A...
Mobile Network Coverage Determination at 900MHz for Abuja Rural Areas using A...Mobile Network Coverage Determination at 900MHz for Abuja Rural Areas using A...
Mobile Network Coverage Determination at 900MHz for Abuja Rural Areas using A...
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
Associative Memory Model について
Associative Memory Model についてAssociative Memory Model について
Associative Memory Model について
 
Image compression and reconstruction using a new approach by artificial neura...
Image compression and reconstruction using a new approach by artificial neura...Image compression and reconstruction using a new approach by artificial neura...
Image compression and reconstruction using a new approach by artificial neura...
 
Cq4201618622
Cq4201618622Cq4201618622
Cq4201618622
 
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
 

Similar to proposal_pura

Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object TrackingIntegrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
ijsrd.com
 
Semantic-based visual emotion recognition in videos-a transfer learning appr...
Semantic-based visual emotion recognition in videos-a transfer  learning appr...Semantic-based visual emotion recognition in videos-a transfer  learning appr...
Semantic-based visual emotion recognition in videos-a transfer learning appr...
IJECEIAES
 
Evaluation of deep neural network architectures in the identification of bone...
Evaluation of deep neural network architectures in the identification of bone...Evaluation of deep neural network architectures in the identification of bone...
Evaluation of deep neural network architectures in the identification of bone...
TELKOMNIKA JOURNAL
 
Object recognition with cortex like mechanisms pami-07
Object recognition with cortex like mechanisms pami-07Object recognition with cortex like mechanisms pami-07
Object recognition with cortex like mechanisms pami-07
dingggthu
 
A Deep Belief Network Approach to Learning Depth from Optical Flow
A Deep Belief Network Approach to Learning Depth from Optical FlowA Deep Belief Network Approach to Learning Depth from Optical Flow
A Deep Belief Network Approach to Learning Depth from Optical Flow
Reuben Feinman
 
Image Segmentation and Classification using Neural Network
Image Segmentation and Classification using Neural NetworkImage Segmentation and Classification using Neural Network
Image Segmentation and Classification using Neural Network
AIRCC Publishing Corporation
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 

Similar to proposal_pura (20)

Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object TrackingIntegrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
 
Machine learning based augmented reality for improved learning application th...
Machine learning based augmented reality for improved learning application th...Machine learning based augmented reality for improved learning application th...
Machine learning based augmented reality for improved learning application th...
 
MULTIPLE HUMAN TRACKING USING RETINANET FEATURES, SIAMESE NEURAL NETWORK, AND...
MULTIPLE HUMAN TRACKING USING RETINANET FEATURES, SIAMESE NEURAL NETWORK, AND...MULTIPLE HUMAN TRACKING USING RETINANET FEATURES, SIAMESE NEURAL NETWORK, AND...
MULTIPLE HUMAN TRACKING USING RETINANET FEATURES, SIAMESE NEURAL NETWORK, AND...
 
Semantic-based visual emotion recognition in videos-a transfer learning appr...
Semantic-based visual emotion recognition in videos-a transfer  learning appr...Semantic-based visual emotion recognition in videos-a transfer  learning appr...
Semantic-based visual emotion recognition in videos-a transfer learning appr...
 
Evaluation of deep neural network architectures in the identification of bone...
Evaluation of deep neural network architectures in the identification of bone...Evaluation of deep neural network architectures in the identification of bone...
Evaluation of deep neural network architectures in the identification of bone...
 
IRJET- Comparative Study of Different Techniques for Text as Well as Object D...
IRJET- Comparative Study of Different Techniques for Text as Well as Object D...IRJET- Comparative Study of Different Techniques for Text as Well as Object D...
IRJET- Comparative Study of Different Techniques for Text as Well as Object D...
 
Algorithmic Analysis to Video Object Tracking and Background Segmentation and...
Algorithmic Analysis to Video Object Tracking and Background Segmentation and...Algorithmic Analysis to Video Object Tracking and Background Segmentation and...
Algorithmic Analysis to Video Object Tracking and Background Segmentation and...
 
Key Frame Extraction for Salient Activity Recognition
Key Frame Extraction for Salient Activity RecognitionKey Frame Extraction for Salient Activity Recognition
Key Frame Extraction for Salient Activity Recognition
 
Object recognition with cortex like mechanisms pami-07
Object recognition with cortex like mechanisms pami-07Object recognition with cortex like mechanisms pami-07
Object recognition with cortex like mechanisms pami-07
 
Scene recognition using Convolutional Neural Network
Scene recognition using Convolutional Neural NetworkScene recognition using Convolutional Neural Network
Scene recognition using Convolutional Neural Network
 
A Deep Belief Network Approach to Learning Depth from Optical Flow
A Deep Belief Network Approach to Learning Depth from Optical FlowA Deep Belief Network Approach to Learning Depth from Optical Flow
A Deep Belief Network Approach to Learning Depth from Optical Flow
 
A Review on Matching For Sketch Technique
A Review on Matching For Sketch TechniqueA Review on Matching For Sketch Technique
A Review on Matching For Sketch Technique
 
Object Detetcion using SSD-MobileNet
Object Detetcion using SSD-MobileNetObject Detetcion using SSD-MobileNet
Object Detetcion using SSD-MobileNet
 
IRJET- Visual Question Answering using Combination of LSTM and CNN: A Survey
IRJET- Visual Question Answering using Combination of LSTM and CNN: A SurveyIRJET- Visual Question Answering using Combination of LSTM and CNN: A Survey
IRJET- Visual Question Answering using Combination of LSTM and CNN: A Survey
 
Image Segmentation and Classification using Neural Network
Image Segmentation and Classification using Neural NetworkImage Segmentation and Classification using Neural Network
Image Segmentation and Classification using Neural Network
 
Image Segmentation and Classification using Neural Network
Image Segmentation and Classification using Neural NetworkImage Segmentation and Classification using Neural Network
Image Segmentation and Classification using Neural Network
 
Kandemir Inferring Object Relevance From Gaze In Dynamic Scenes
Kandemir Inferring Object Relevance From Gaze In Dynamic ScenesKandemir Inferring Object Relevance From Gaze In Dynamic Scenes
Kandemir Inferring Object Relevance From Gaze In Dynamic Scenes
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Fast Feature Pyramids for Object Detection
Fast Feature Pyramids for Object DetectionFast Feature Pyramids for Object Detection
Fast Feature Pyramids for Object Detection
 
Scene Description From Images To Sentences
Scene Description From Images To SentencesScene Description From Images To Sentences
Scene Description From Images To Sentences
 

proposal_pura

  • 1. Linear Recurrent Convolutional Networks for Segment-Based Multiple Object Tracking Erick Lin, Amirreza Shaban, Dr. Byron Boots Robot Learning Lab, Institute for Robotics and Intelligent Machines Introduction Automatic object tracking in moving images has remained a long-standing problem in the domain of computer vision, yet it is of paramount practical importance in many day-to- day scenarios, especially those involving surveillance or human-computer interaction. The application of eye-tracking technology, for example, has enabled insights into how humans process visual information such as text, which have led to the development of more effective methods of diagnostics as well as accessible digital interfaces [2]. Regarded as a problem complementary to object recognition, or the process of identifying objects in still images by the pixels that compose them, object tracking focuses on the sub- sequent task of matching these objects in one image to the same objects whose appearances differ slightly in another image, whether in position, lighting, or even events such as occlu- sion by objects closer to the foreground. Objects that can be tracked include any tangible items, people, landmarks, or even parts of other objects that are considered by humans visually to be separate entities [8]. Motivation Object recognition and object tracking have both been framed in the context of machine learning. For object recognition, learning models traditionally take the form of convolutional neural networks (CNNs), which have been highly favored due to their lower training times compared to similar models and their ability to take advantage of object locality, the prop- erty that pixels that make up an object share the same neighborhood. In object recognition for a single image, convolutional neural networks are often used to output the set of all the superpixels, or groups of pixels that are similar in location and color, in that image. Afterward, one of a variety of robust methods such as the POISE [4] algorithm, which has been successful at addressing the problem of recognizing objects located far from image boundaries, is used to merge superpixels into segments, which are intended to represent whole objects. Learning models that serve the purpose of object tracking have seen especially swift progress in recent years, with recent breakthroughs in computational efficiency being made through the use of linear regression and greedy matching techniques [6]. We will utilize a variation of recurrent neural networks (RNNs), which are characterized by one or more direct feedback loops from outputs to inputs, to build a fast learning model for tracking all the visible objects in an image over time. While models based on a form of RNN known as the long short-term memory (LSTM) network have performed successfully on tasks such as annotating individual images with English-language descriptions [1], a shortcoming of LSTM networks is that they are composed of nonlinear transformations on input data, so these models require larger quantities of training data to avoid the statistical problem of overfitting, and are hence also more time-consuming to train. Thus, LSTM networks are infeasible for the large dataset sizes typically associated with moving images; on the other hand, by being composed exclusively of linear transformations, our RNN architecture for multiple object tracking is intended to circumvent this issue. 1
  • 2. Objective We have thus far prototyped our current design of the linear recurrent convolutional neu- ral network model using the open-source deep learning framework Caffe. For classifying superpixels, we will use a “deep” or many-layered convolutional neural network implemen- tation such as AlexNet [5] which performs well on high-resolution images. We will then perform the image processing technique of average pooling of the superpixels – that is, av- eraging the characteristics of the pixels in each superpixel to render it uniform and thereby better-contrasting with other superpixels for subsequent processing. Next, we will run a segmentation procedure such as the previously mentioned POISE algorithm, and follow this with average pooling of the segments by their superpixels. Next, a sequence of fully connected neural networks (NNs), a more classic architecture which serves a wider variety of purposes in machine learning, will learn and then perform the nonlinear mappings of the segmentation data, making it possible for our object tracking computations to retain their linearity. This step is justifiable in spite of the relative expense of NNs and the aforementioned shortcomings of nonlinear models, because the segmenta- tion data is much smaller in size than the original image data. Our entire architecture is summarized below for convenience. Input → CNN → Pooling + POISE → Nonlinear NN → Tracker → Output The centerpiece of our architecture, the newly proposed linear recurrent neural network, is referred to as the Tracker layer. Our prototype of the Tracker layer so far is governed by the following equations, which are also visualized in a network diagram. Eq (5) Eq (4) Eq (2)Eq (1) Eq (6) Eq (3) Ht = Ht−1 + Xt MtXt (1) Ct = Ct−1 + ˜Vt MtXt (2) ˜Vt = φ1(Vt ) (3) Vt = Wt−1Xt (4) Wt−1 = Ct−1(Ht−1 + λst−1I)−1 (5) Mt = δ(φ2(Vt )) (6) st = st−1 + σ(Mt) (7) In these equations, t is the number of the current frame and can be seen as a time parameter, Xt is the primary input matrix whose rows represent segments, λ is a regular- ization constant commonly used in machine learning to prevent overfitting, and Ht and Ct are hidden and memory cell units, respectively, which account for the cumulative nature of previously seen examples. φ1 is given by the operation that keeps only the maximum value in each row of a matrix while zeroing out the rest, δ converts an n-dimensional vector into an n-by-n diagonal matrix, and σ sums all the elements in a matrix. The primary output is ˜Vt, which encodes the best matchings between the existing segments and the segments in the current frame. As of now, φ2 and its parametrization remain unknown, and equation (5) still involves a matrix inverse operation, which is known to be computationally expensive. Thus, one of my primary objectives will be to work out the remaining details and modify the design of the Tracker layer in order to further improve its efficiency and accuracy – in the case of the matrix inverse, I will need to consider faster approximation methods that are feasible given our knowledge of the structure of the Xt matrix. In addition, the previous parts of the pipeline currently require training in a supervised manner for a performance boost. This involves using a set of input image sequences paired 2
  • 3. with ground truth labels, or the absolute segmentations for each image that are known to be correct; I will obtain this data by applying the POISE segmentation proposal algorithm to the publically-available Sintel dataset, which contains a collection of video sequences originating from the open-source computer animated film of the same name. The Sintel dataset also includes the ground truth optical flow for each image, which describes pixel- wise movement from the current image to the next. In order to match ground truth segments from each frame to the next, I will need to write algorithms that combine the segmentation and optical flow data for any frame to produce the predicted superpixels and segments for the next frame with occlusion handling, then match the predicted with the ground truth superpixels and segments of the actual subsequent frame by their overlap, or the size of their pairwise intersection divided by the size of their pairwise union. We may consider adding an additional phase following the Tracker layer which further improves segmentation results by using known refinement techniques such as composite statistical inference (CSI) [6]. Finally, we will compare the performances of our linear recurrent convolutional network on various metrics with the currently established video segmentation benchmarks [3]. Conclusion In this proposal, I have described a linear RNN-based model which may outperform the state-of-the-art approaches in object tracking and mark the first appearance of such a class of models for this specific application. Our ideal end goal is a multiple object tracking system that works in real time with incoming video streams up to some certain resolution, a tool which would prove beneficial for many critical as well as everyday settings. References [1] Donahue, J., Hendricks, L. A., Guadarrama, S., and Rohrbach, M. Long- term recurrent convolutional networks for visual recognition and description. In Neural Information Processing Systems (2007). [2] Duchowski, A. T. A breadth-first survey of eye tracking applications. Behavior Research Methods, Instruments, and Computers (2002). [3] Galasso, F., Nagaraja, N. S., Ca’rdenas, T. J., Brox, T., and Schiele, B. A unified video segmentation benchmark: Annotation, metrics, and analysis. In Computer Vision and Pattern Recognition (2013). [4] Humayun, A., Li, F., and Rehg, J. M. The middle child problem: Revisiting para- metric min-cut and seeds for object proposals. In International Conference on Computer Vision (2015). [5] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems Confer- ence (2012). [6] Li, F., Kim, T., Humayun, A., Tsai, D., and Rehg, J. M. Video segmentation by tracking many figure-ground segments. In International Conference on Computer Vision (2013). [7] Long, J., Shelhamer, E., and Darrell, T. Fully convolutional networks for semantic segmentation. In Computer Vision and Pattern Recognition (2015). [8] Luo, W., Xing, J., Zhang, X., Zhao, X., and Kim, T.-K. Multiple object tracking: A literature review. ACM Computing Surveys (2015). 3