This document proposes a linear recurrent convolutional neural network model for segment-based multiple object tracking in video. The model takes images as input and uses a CNN to classify superpixels, then performs segmentation and uses nonlinear NNs and a linear recurrent tracker layer to match segments over time. The objectives are to improve the tracker layer efficiency by modifying the matrix inverse and determine parameters for the model. Evaluation will use a dataset with ground truth segmentation and optical flow to train and compare to state-of-the-art methods.
1. Linear Recurrent Convolutional Networks for Segment-Based
Multiple Object Tracking
Erick Lin, Amirreza Shaban, Dr. Byron Boots
Robot Learning Lab, Institute for Robotics and Intelligent Machines
Introduction
Automatic object tracking in moving images has remained a long-standing problem in the
domain of computer vision, yet it is of paramount practical importance in many day-to-
day scenarios, especially those involving surveillance or human-computer interaction. The
application of eye-tracking technology, for example, has enabled insights into how humans
process visual information such as text, which have led to the development of more effective
methods of diagnostics as well as accessible digital interfaces [2].
Regarded as a problem complementary to object recognition, or the process of identifying
objects in still images by the pixels that compose them, object tracking focuses on the sub-
sequent task of matching these objects in one image to the same objects whose appearances
differ slightly in another image, whether in position, lighting, or even events such as occlu-
sion by objects closer to the foreground. Objects that can be tracked include any tangible
items, people, landmarks, or even parts of other objects that are considered by humans
visually to be separate entities [8].
Motivation
Object recognition and object tracking have both been framed in the context of machine
learning. For object recognition, learning models traditionally take the form of convolutional
neural networks (CNNs), which have been highly favored due to their lower training times
compared to similar models and their ability to take advantage of object locality, the prop-
erty that pixels that make up an object share the same neighborhood. In object recognition
for a single image, convolutional neural networks are often used to output the set of all
the superpixels, or groups of pixels that are similar in location and color, in that image.
Afterward, one of a variety of robust methods such as the POISE [4] algorithm, which has
been successful at addressing the problem of recognizing objects located far from image
boundaries, is used to merge superpixels into segments, which are intended to represent
whole objects.
Learning models that serve the purpose of object tracking have seen especially swift
progress in recent years, with recent breakthroughs in computational efficiency being made
through the use of linear regression and greedy matching techniques [6]. We will utilize
a variation of recurrent neural networks (RNNs), which are characterized by one or more
direct feedback loops from outputs to inputs, to build a fast learning model for tracking all
the visible objects in an image over time. While models based on a form of RNN known
as the long short-term memory (LSTM) network have performed successfully on tasks such
as annotating individual images with English-language descriptions [1], a shortcoming of
LSTM networks is that they are composed of nonlinear transformations on input data, so
these models require larger quantities of training data to avoid the statistical problem of
overfitting, and are hence also more time-consuming to train. Thus, LSTM networks are
infeasible for the large dataset sizes typically associated with moving images; on the other
hand, by being composed exclusively of linear transformations, our RNN architecture for
multiple object tracking is intended to circumvent this issue.
1
2. Objective
We have thus far prototyped our current design of the linear recurrent convolutional neu-
ral network model using the open-source deep learning framework Caffe. For classifying
superpixels, we will use a “deep” or many-layered convolutional neural network implemen-
tation such as AlexNet [5] which performs well on high-resolution images. We will then
perform the image processing technique of average pooling of the superpixels – that is, av-
eraging the characteristics of the pixels in each superpixel to render it uniform and thereby
better-contrasting with other superpixels for subsequent processing. Next, we will run a
segmentation procedure such as the previously mentioned POISE algorithm, and follow this
with average pooling of the segments by their superpixels.
Next, a sequence of fully connected neural networks (NNs), a more classic architecture
which serves a wider variety of purposes in machine learning, will learn and then perform
the nonlinear mappings of the segmentation data, making it possible for our object tracking
computations to retain their linearity. This step is justifiable in spite of the relative expense
of NNs and the aforementioned shortcomings of nonlinear models, because the segmenta-
tion data is much smaller in size than the original image data. Our entire architecture is
summarized below for convenience.
Input → CNN → Pooling + POISE → Nonlinear NN → Tracker → Output
The centerpiece of our architecture, the newly proposed linear recurrent neural network, is
referred to as the Tracker layer. Our prototype of the Tracker layer so far is governed by
the following equations, which are also visualized in a network diagram.
Eq (5)
Eq (4)
Eq (2)Eq (1)
Eq (6)
Eq (3)
Ht = Ht−1 + Xt MtXt (1)
Ct = Ct−1 + ˜Vt MtXt (2)
˜Vt = φ1(Vt ) (3)
Vt = Wt−1Xt (4)
Wt−1 = Ct−1(Ht−1 + λst−1I)−1
(5)
Mt = δ(φ2(Vt )) (6)
st = st−1 + σ(Mt) (7)
In these equations, t is the number of the current frame and can be seen as a time
parameter, Xt is the primary input matrix whose rows represent segments, λ is a regular-
ization constant commonly used in machine learning to prevent overfitting, and Ht and Ct
are hidden and memory cell units, respectively, which account for the cumulative nature of
previously seen examples. φ1 is given by the operation that keeps only the maximum value
in each row of a matrix while zeroing out the rest, δ converts an n-dimensional vector into
an n-by-n diagonal matrix, and σ sums all the elements in a matrix. The primary output
is ˜Vt, which encodes the best matchings between the existing segments and the segments in
the current frame.
As of now, φ2 and its parametrization remain unknown, and equation (5) still involves
a matrix inverse operation, which is known to be computationally expensive. Thus, one of
my primary objectives will be to work out the remaining details and modify the design of
the Tracker layer in order to further improve its efficiency and accuracy – in the case of the
matrix inverse, I will need to consider faster approximation methods that are feasible given
our knowledge of the structure of the Xt matrix.
In addition, the previous parts of the pipeline currently require training in a supervised
manner for a performance boost. This involves using a set of input image sequences paired
2
3. with ground truth labels, or the absolute segmentations for each image that are known to
be correct; I will obtain this data by applying the POISE segmentation proposal algorithm
to the publically-available Sintel dataset, which contains a collection of video sequences
originating from the open-source computer animated film of the same name. The Sintel
dataset also includes the ground truth optical flow for each image, which describes pixel-
wise movement from the current image to the next. In order to match ground truth segments
from each frame to the next, I will need to write algorithms that combine the segmentation
and optical flow data for any frame to produce the predicted superpixels and segments for
the next frame with occlusion handling, then match the predicted with the ground truth
superpixels and segments of the actual subsequent frame by their overlap, or the size of their
pairwise intersection divided by the size of their pairwise union.
We may consider adding an additional phase following the Tracker layer which further
improves segmentation results by using known refinement techniques such as composite
statistical inference (CSI) [6]. Finally, we will compare the performances of our linear
recurrent convolutional network on various metrics with the currently established video
segmentation benchmarks [3].
Conclusion
In this proposal, I have described a linear RNN-based model which may outperform the
state-of-the-art approaches in object tracking and mark the first appearance of such a class
of models for this specific application. Our ideal end goal is a multiple object tracking
system that works in real time with incoming video streams up to some certain resolution,
a tool which would prove beneficial for many critical as well as everyday settings.
References
[1] Donahue, J., Hendricks, L. A., Guadarrama, S., and Rohrbach, M. Long-
term recurrent convolutional networks for visual recognition and description. In Neural
Information Processing Systems (2007).
[2] Duchowski, A. T. A breadth-first survey of eye tracking applications. Behavior
Research Methods, Instruments, and Computers (2002).
[3] Galasso, F., Nagaraja, N. S., Ca’rdenas, T. J., Brox, T., and Schiele, B. A
unified video segmentation benchmark: Annotation, metrics, and analysis. In Computer
Vision and Pattern Recognition (2013).
[4] Humayun, A., Li, F., and Rehg, J. M. The middle child problem: Revisiting para-
metric min-cut and seeds for object proposals. In International Conference on Computer
Vision (2015).
[5] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with
deep convolutional neural networks. In Neural Information Processing Systems Confer-
ence (2012).
[6] Li, F., Kim, T., Humayun, A., Tsai, D., and Rehg, J. M. Video segmentation by
tracking many figure-ground segments. In International Conference on Computer Vision
(2013).
[7] Long, J., Shelhamer, E., and Darrell, T. Fully convolutional networks for
semantic segmentation. In Computer Vision and Pattern Recognition (2015).
[8] Luo, W., Xing, J., Zhang, X., Zhao, X., and Kim, T.-K. Multiple object tracking:
A literature review. ACM Computing Surveys (2015).
3