1. Lecture 2 Overview on object recognition and one practical example 6.870 Object Recognition and Scene Understanding http://people.csail.mit.edu/torralba/courses/6.870/6.870.recognition.htm
2.
3. Object recognition Is it really so hard? Output of normalized correlation This is a chair Find the chair in this image
4. Object recognition Is it really so hard? My biggest concern while making this slide was: how do I justify 50 years of research, and this course, if this experiment did work? Pretty much garbage Simple template matching is not going to make it Find the chair in this image
5. Object recognition Is it really so hard? A “popular method is that of template matching, by point to point correlation of a model pattern with the image pattern. These techniques are inadequate for three-dimensional scene analysis for many reasons, such as occlusion, changes in viewing angle, and articulation of parts.” Nivatia & Binford, 1977. Find the chair in this image
6. So, let’s make the problem simpler: Block world Nice framework to develop fancy math, but too far from reality… Object Recognition in the Geometric Era: a Retrospective. Joseph L. Mundy. 2006
7. Binford and generalized cylinders Object Recognition in the Geometric Era: a Retrospective. Joseph L. Mundy. 2006
9. Recognition by components Irving Biederman Recognition-by-Components: A Theory of Human Image Understanding. Psychological Review, 1987.
10.
11.
12.
13.
14. Stages of processing “ Parsing is performed, primarily at concave regions, simultaneously with a detection of nonaccidental properties.”
15. Non accidental properties Certain properties of edges in a two-dimensional image are taken by the visual system as strong evidence that the edges in the three-dimensional world contain those same properties. Non accidental properties, (Witkin & Tenenbaum,1983): Rarely be produced by accidental alignments of viewpoint and object features and consequently are generally unaffected by slight variations in viewpoint. e.g., Freeman, “the generic viewpoint assumption”, 1994 ? image
16.
17. The high speed and accuracy of determining a given nonaccidental relation {e.g., whether some pattern is symmetrical) should be contrasted with performance in making absolute quantitative judgments of variations in a single physical attribute, such as length of a segment or degree of tilt or curvature. Object recognition is performed by humans in around 100ms.
18. “If contours are deleted at a vertex they can be restored, as long as there is no accidental filling-in. The greater disruption from vertex deletion is expected on the basis of their importance as diagnostic image features for the components.” Recoverable Unrecoverable
19. From generalized cylinders to GEONS “ From variation over only two or three levels in the nonaccidental relations of four attributes of generalized cylinders, a set of 36 GEONS can be generated.” Geons represent a restricted form of generalized cylinders.
31. Discriminative methods Object detection and recognition is formulated as a classification problem. Decision boundary … and a decision is taken at each window about if it contains a target object or not. Where are the screens? The image is partitioned into a set of overlapping windows Bag of image patches Computer screen Background In some feature space
40. Toy example Weak learners from the family of lines h => p(error) = 0.5 it is at chance Each data point has a class label: w t =1 and a weight: +1 ( ) -1 ( ) y t =
41. Toy example This one seems to be the best Each data point has a class label: w t =1 and a weight: This is a ‘ weak classifier ’: It performs slightly better than chance. +1 ( ) -1 ( ) y t =
42. Toy example We set a new problem for which the previous weak classifier performs at chance again Each data point has a class label: w t w t exp{-y t H t } We update the weights: +1 ( ) -1 ( ) y t =
43. Toy example We set a new problem for which the previous weak classifier performs at chance again Each data point has a class label: w t w t exp{-y t H t } We update the weights: +1 ( ) -1 ( ) y t =
44. Toy example We set a new problem for which the previous weak classifier performs at chance again Each data point has a class label: w t w t exp{-y t H t } We update the weights: +1 ( ) -1 ( ) y t =
45. Toy example We set a new problem for which the previous weak classifier performs at chance again Each data point has a class label: w t w t exp{-y t H t } We update the weights: +1 ( ) -1 ( ) y t =
46. Toy example The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers. f 1 f 2 f 3 f 4
47.
48.
49.
50. Exponential loss -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 3.5 4 Squared error Exponential loss yF(x) = margin Misclassification error Loss Squared error Exponential loss
51. Boosting Sequential procedure. At each step we add For more details: Friedman, Hastie, Tibshirani. “Additive Logistic Regression: a Statistical View of Boosting” (1998) to minimize the residual loss input Desired output Parameters weak classifier
52.
53.
54. gentleBoosting.m function classifier = gentleBoost (x, y, Nrounds) … for m = 1:Nrounds fm = selectBestWeakClassifier (x, y, w); w = w .* exp(- y .* fm); % store parameters of fm in classifier … end Solve weighted least-squares Re-weight training samples Initialize weights w = 1
55. Demo gentleBoosting > demoGentleBoost.m Demo using Gentle boost and stumps with hand selected 2D data:
56.
57.
58.
59. Object recognition Is it really so hard? But what if we use smaller patches? Just a part of the chair? Find the chair in this image
60. Parts But what if we use smaller patches? Just a part of the chair? Seems to fire on legs… not so bad Find a chair in this image
69. Weak detectors First we collect a set of part templates from a set of training objects. Vidal-Naquet, Ullman (2003) …
70. Weak detectors We now define a family of “weak detectors” as: = = Better than chance *
71. Weak detectors We can do a better job using filtered images Still a weak detector but better than before * * = = =
72. Training First we evaluate all the N features on all the training images. Then, we sample the feature outputs on the object center and at random locations in the background:
73. Representation and object model Selected features for the screen detector Lousy painter … 4 10 1 2 3 … 100
82. Example: screen detection + … Feature output Thresholded output Strong classifier Adding features Final classification Strong classifier at iteration 200
83. Maximal suppression Detect local maximum of the response. We are only allowed detecting each object once. The rest will be considered false alarms. This post-processing stage can have a very strong impact in the final performance.
84.
85. Demo > runDetector.m Demo of screen and car detectors using parts, Gentle boost, and stumps:
86.
87.
88.
89.
Editor's Notes
The first question that we might have, specially after seeing the other parts of this course, is why to use a discriminative approach? This slide shows one of the arguments some times used to illustrate the advantage that a discriminative framework might have over a generative framework. It is important to not that most of the time, generative models are also lousy painters. Generative models do not try to have a full generative model up to the pixel level. In fact, those cases, as the ones reviewed before in which the generative model is applied only to a set of sparse features are in fact a mixture of a discriminative method (for the low level vision part) and a generative method (for the high level vision part). Even if there is no discriminative training, the choice of the representation is given by our experience as vision researchers exploring alternatives until one works: just as a discriminative method will do.
Object detection and recognition is formulated as a classification problem. The image is partitioned into a set of overlapping windows, and a decision is taken at each window about if it contains a target object or not. Each window is represented by extracting a large number of features that encode information such as boundaries, textures, color, spatial structure. The classification function, that maps an image window into a binary decision, is learnt using methods such as SVMs, boosting or neural networks.
In many cases, F(x) will give a real value and the classification is perform with a Threshold. If F(x)>0 then y =+1
In the next part I will focus on particular technique, and I will provide a detailed description of things you need to know to make it work. This part is accompanied by a Matlab toolbox that you can find on the web site of the class. Is entirely self contained and written in matlab. It has an implementation of gentle boosting, and an object detector. The demo shows results on detecting screens and cars and can be easily trained to detect other objects.
One of the first papers to use boosting for a vision task. Take one filter, filter the image, take the absolute value and downsample. Take another filter, take absolute value of the output and downsample. Do this once more. Then, for each color channel independently, sum the value of all the pixels and the resulting number will be one of the features. If you do this for all triples of filters, you’ll end up with 46875 features.
No need of image pyramid in order to achieve scale invariance. Very efficient. Works well for faces which are objects well defined by the intensity pattern. Combined with a cascade, achieves real time performance for face detection even on large images.
This will be better than a full object template because it allows for some flexibility in the location of the parts. It is still easy to compute.
The simplest weak detector will be using template matching. In this case we use template matching to detect the presence of a part of the object. Once the presence is detected it will vote for the object in the center of the object coordinates frame. In this example, we are trying to detect a computer screen. We use the top left corner of the screen. The first thing we do is compute a normalize correlation between the image and the template. This will give a pick in the response near the top-left corner of the screen and, of course, many other matches that do not correspond to screen. Now we displace the output of the correlation taking into account the relative positions between the top left corner of the screen and the center of the screen. We also blur the output. Now, if we apply a threshold we can see that the center of the screen has a value above the threshold while most of the background gets suppressed. There are a lot of other locations that also are above the threshold. Therefore, this is not a good detector. But it is better than chance!
The simplest weak detector will be using template matching. In this case we use template matching to detect the presence of a part of the object. Once the presence is detected it will vote for the object in the center of the object coordinates frame. In this example, we are trying to detect a computer screen. We use the top left corner of the screen. The first thing we do is compute a normalize correlation between the image and the template. This will give a pick in the response near the top-left corner of the screen and, of course, many other matches that do not correspond to screen. Now we displace the output of the correlation taking into account the relative positions between the top left corner of the screen and the center of the screen. We also blur the output. Now, if we apply a threshold we can see that the center of the screen has a value above the threshold while most of the background gets suppressed. There are a lot of other locations that also are above the threshold. Therefore, this is not a good detector. But it is better than chance!
So, now we have to train. We need positive and negative examples. So we do the next thing: get positive samples by taking the values of all the features in the center of the target. Then take negative samples from the background.
In a probabilistic framework, opposed to the generative model, here we fit a conditional probability. Boosting is a way of fitting a logistic regression.
Finally, the object model is very similar to the one built with the generative model. Here, invariance in translation and scale is achieved by the search strategy: the classifier is evaluated at all locations (by translating the image) and at all scales (by reescaling the image in small steps).