1             Counting People in Crowded Environments: An Overview                 Michael Pätzold (1) / Rubén Heras Evang...
2network previously trained for this particular scene. Paragios and Ramesh [12] propose a method that alsoextracts foregro...
3    Breitenstein et al. propose in [3] a method that accounts the uncertainty of association by applying a multi-modal pa...
4                  Figure 2: Upper body HOG-detector: (a): An upper body image sample of the training                 data...
5    Figure 3: Probability maps: A scene with high crowd density shows the concept of information fusion to find head cand...
6                  Figure 4: Linking detections by coherent motion detection: (a): False positive                  detecti...
76. REFERENCES   [1] Inria dataset. Available online at http://lear.inrialpes.fr/data.   [2] Antonio Albiol, Maria J. Sill...
Próxima SlideShare
Cargando en…5

Hoip10 articulo counting people in crowded environments_univ_berlin

1.257 visualizaciones

Publicado el

Artículo presentado por la Universidad de Berlín durante las jornadas HOIP'10 organizadas por la Unidad de Sistemas de Información e Interacción TECNALIA.

Más información en http://www.tecnalia.com/es/ict-european-software-institute/index.htm

Publicado en: Tecnología
0 comentarios
0 recomendaciones
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Sin descargas
Visualizaciones totales
En SlideShare
De insertados
Número de insertados
Insertados 0
No insertados

No hay notas en la diapositiva.

Hoip10 articulo counting people in crowded environments_univ_berlin

  1. 1. 1 Counting People in Crowded Environments: An Overview Michael Pätzold (1) / Rubén Heras Evangelio (1) / Thomas Sikora(1) (1) Communication Systems Group, Technische Universität Berlin, Germany ABSTRACT Counting the number of persons in a crowded scene is of big interest in many applications. Most of theproposed approaches in the literature tackle the task of counting people in an indirect, statistical way. Recently,we presented a direct, counting-by-detection method based on fusing shape information obtained from anadapted Histogram of Oriented Gradients algorithm (HOG) with temporal information. The use of temporalinformation reduces false positives by considering the characteristics of motion of different human body parts. Asubsequent tracking and coherent motion detection of the human hypotheses enhance the performance of thissystem additionally. The performance obtained by this system is comparable to state-of-the-art systems whileallowing not only counting people but also providing valuable information for a tracking approach. In thispaper we present an overview of relevant state-of-the-art methods for counting people in crowded environments,paying special attention to the method proposed by our group and showing results based on standard videosequences.1. INTRODUCTION The estimation of the number of people within a scene is an essential component for higher level videoanalysis layers. The availability of a system that measures the density of a crowd is inevitable for securityapplications like prevention of overcrowded areas at public places. Furthermore, information gathered by suchsystems can be taken as data basis for economic applications like optimization of public transport schedules,reduction of waiting time in supermarkets or to assess the effectiveness of advertising. Since static videocameras are available at most public places, there is a big interest in solutions for counting people in video. Inmost cases the cameras are already installed, so that algorithms should handle a wide range of perspectives andvarying lightning conditions. A classic people tracking system extracts foreground pixels of an image by subtracting the current imagefrom a learned statistical background model and aggregates the resulting foreground pixels to form objects bymeans of connected components [16]. The number of objects can be derived directly by counting them. Suchsystems are restricted to areas with slowly changing lightning conditions and the count of objects can only bedetermined, if they do not interact with or occlude each other. Due to these restrictions people counting specific algorithms have been developed, which can be divided intotwo groups. Under the assumption of the impossibility of isolating every entity of a crowd, the first group ofalgorithms extracts low level information (e.g. foreground pixels or moving points) and uses it in a further stepto estimate a value for the density of the crowd. The second group of algorithms searches for the object ofinterest (e.g. persons) based on a model and counts directly the number of found entities. Object models can bebuilt out of different types of information like motion or shape. Furthermore, the confidence of a detection of anobject can be increased by imposing a temporal consistency constraint. This constraint requires the associationof detected objects in consecutive frames of a video sequence. Data association might be a challenging task incrowded environments due to multiple detections and inter-object-occlusion. In Section 2 we elaborate our grouping in detail and subdivide relevant state-of-the-art approaches accordingto their type of analysis. In the subsequent section we describe our approach. Experimental results are shown inSection 4 and Section 5 concludes the paper.2. PEOPLE COUNTING METHODS2.1. LOW LEVEL CROWD ANALYSIS Assuming that a crowd is dense enough so that individuals cannot be separated, low-level based methodsinfer the number of people in the scene by some kind of mapping from data acquired by low level computervision techniques into an estimate of the crowd density. Hou and Pang [10] extract foreground pixels of an image by subtracting the current image from a learnedstatistical background model and map the foreground pixels to the number of people of a scene by using a neural
  2. 2. 2network previously trained for this particular scene. Paragios and Ramesh [12] propose a method that alsoextracts foreground pixels, but they handle the influence of perspective by explicitly weighting the pixelsaccording to geometric information. Foreground areas computed by subtraction from a statistical backgroundmodel can be distorted because of uncontrolled lightning conditions in open environments. Albiol et al. [2]address this problem by counting only moving corner points and assume that the number of these points isrelated linearly to the number of people in the scene. Conte et al. [6] claim that the relation of detected pointsand number of people in a scene is more complex than a direct linear mapping. Therefore, they propose the useof an epsilon support vector regressor for this task. Furthermore, they achieve a high stability of the trackedpoints by applying a scale-invariant point descriptor. For all the algorithms of this group it should be noted, thata mapping is scene-depended and thus has to be relearned for every particular scene. This implies the existenceof representative training data for every camera setup.2.2. FOREGROUND SEGMENTATION MODEL BASED ANALYSIS Besides the methods based on low-level information there exist a small number of approaches that tacklethe significant challenge of finding the configuration of people within a crowd (including their number, positionand also articulation) by only using foreground masks of a video sequence [20, 9]. In [20] Zhao and Nevatiadevelop a person model based on human shapes observed under different perspectives, which defines theamount of pixels occupied by a person given its position and articulation. A given configuration of people isevaluated by comparing their occupied area with the area of the foreground mask. An appropriate solution for agiven foreground mask is found by sampling the resulting high-dimensional space of configurations by means ofMarkov-chain-Monte-Carlo methods. The efficiency of this method depends heavily on the number of samplesto process. This number is decreased by incorporating additional information into the proposal probability of theMarkov chain (e.g. proposing head positions based on a shape model).2.3. MOTION MODEL BASED CROWD ANALYSIS Distinctive motion of individiual humans is used by several algorithms for crowd counting [15, 4, 14]. Bymodeling an entity as a region with coherent motion it is possible to distinguish them by analyzing the flow ofcharacteristic points. Rossi and Bozzoli [15] build trajectories out of tracking characteristic points in areas with detected temporalchanges and estimate the number of people in an image by agglomerativelly clustering these points according totheir motion. Since this method does not handle occlusion cases, they only applied it to ceiling mounted cameraswith vertical viewing direction. Brostow and Cipolla publish in [4] a method that tracks characteristic featureswith the help of optical flow and uses an unsupervised data driven Bayesian clustering algorithm. This methodachieves good counting results on over-head camera setups with nearly vertical viewing direction. Applying thismethod to camera setups with lower tilt angle might give results of worse quality, since from this viewingdirection human limbs, whose motion is non-coherent, are more visible. Therefore, it is difficult to achieve anaccurate clustering. Furthermore, a lower tilt angle leads to inter-object occlusions which complicate theanalysis of motion.2.4. SHAPE MODEL BASED ANALYSIS It is not always possible to separate people only incorporating motion information, for instance if they arewalking in unison. Analysis of shape and appearance of an object can also be considered to count objects ofinterest (e.g. humans) and distinguish them from objects belonging to another class (e.g. bicycles or cars). In thecase of humans it is reasonable that models based on shape should be preferred to models containing colorinformation. In [7] Dalal and Triggs published a method which evaluates gradient information of still images bymeans of a machine learning algorithm. This method is able to detect humans in still images under a wide rangeof perspectives with a good reliability, but do not provide any means for handling partially occlusions, which isa common issue in the detection of persons in crowded environments. Wu and Nevatia [18] tackled this problemby designing human part detectors and combining the detected parts of a human in a joined likelihood model.2.5. MULTI TARGET TRACKING IN CROWDED SCENES Temporal association of found targets from one image to the next image can provide information about thetrack of people and, furthermore, the gained temporal consistency improves the confidence of the presence of aperson. In very sparse crowds or in camera setups with perfect viewing direction the output correspondence ofhuman-detectors can be assigned without ambiguity. If these conditions are not met it is possible that objects getoccluded by static scene items or occlude each other temporarily. In this case straightforward association ofdetections by basic techniques is hampered and sophisticated data association methods and appearance modelsare required. While our algorithm uses a basic association technique between two time steps, the tracking andthus the counting performance is increased by using one of the following methods.
  3. 3. 3 Breitenstein et al. propose in [3] a method that accounts the uncertainty of association by applying a multi-modal particle filter for every target. Furthermore, the association is improved by learning an appearance modelfor every target by online-boosting. This system is able to keep track of multiple persons even under fullocclusion. But initializing a tracker requires the target being previously detected for multiple frames with highconfidence. This initialization problem can be prevented by global data association based tracking, whichanalyzes the full video sequence at once and, hence, is able to associate the detections globally. As the search space of global data association methods is combinatorial a naive search is inappropriate,because of its exponential computational complexity. Zhang et al. [19] propose an approach to find an optimalsolution in an efficient manner by formulating the problem as a cost flow network and integrating an explicitocclusion model. To associate smaller track fragments to final trajectories a suitable affinity measure is required.While these measures and its parameters are chosen manually in most cases, Li et al. develop in [11] a learning-based algorithm that tracks people in crowded scenes and provides efficiency by hierarchically assembling trackfragments over multiple stages and automatically choosing appropriate affinity measures for each stage.3. OUTLINE OF OUR APPROACH Recently, we published a model-based algorithm for counting persons in crowded scenarios [13]. Thismethod is described in the following section. Figure 1 depicts how the various modules and their interactionprovide a count of individuals observed by a single stationary camera. We trained a detector to find the upperbody region of a human (1). Since the human head and torso contain only marginal shape information, adetector only based on this approach with acceptable detection rate would create a vast amount of falsepositives. To avoid these false positives we combine the shape model with a uniform motion model (3),generated by using optical flow information between consecutive frames (2), which leads to a combinedprobability map with uniform motion and characteristic shape. By seeking the modes of this probability map (4)discrete detections are obtained. These detections are associated to trajectories by using motion information (5).In parallel by enforcing temporal consistency of detections false positives are rejected. Finally, we apply analgorithm which validates trajectories for coherent motion indicating that they belong to the same human bodyand thus, by keeping only one of the matches, the false detection rate is reduced again (6). By counting thenumber of finally resulting detections the number of people in the scene per frame is obtained. Figure 1: Overview of system modules and their interaction3.1. SHAPE MODEL Due to the high class variation of humans with respect to color and texture, most of the recent detectors usegradient information as feature descriptors to find human-like shape in images. Dalal and Triggs in [7] developthe Histogram of Oriented Gradients method, which is robust to the variable appearance of humans. Thisrobustness is achieved by collecting the gradients within a small region (cells) of an image and representingthem as histograms of their orientation. After normalization and concatenation of the histograms of adjacentregions, they obtain a descriptor which is classified by a support vector machine. Dalal and Triggs show resultswith good detection rates for humans in different articulations and different perspectives. But, since the detectoris learnt on pictures containing the complete human body area, it has difficulties to detect partially occluded
  4. 4. 4 Figure 2: Upper body HOG-detector: (a): An upper body image sample of the training database. (b): The main active blocks of our trained HOG-detector are located at the head outline. (c): A sample HOG descriptor weighted by the positive SVM weights.persons, which are common in a dense crowd scenario. In this approach, we overcome the mentioned occlusion-shortcomings by learning only the head-shoulder region of a human. Thereto, the cell dimension is changed anda training database of the head-shoulder region of humans is established. The detector is applied on all areas ofthe image and thus a probability can be built with the help of the gained confidence values. By using a linearSVM-kernel for classifying the samples the probability map is written as max · ,0 ,where is the HOG-descriptor at location , is the trained decision hyper-plane and b the trained bias.Figure 2(b) depicts the main active blocks of the descriptor and Figure 2(c) reveals that mainly the horizontalgradients on the head decide whether a sample contains a head or not. Figure 3(b) shows a computed probabilitymap of a sample input frame. Compared to the HOG-detector proposed by Dalal and Triggs this detector has tomake a decision based on less information and thus is not able to classify every test sample reliably. It detects allstructures with so called omega-shape, mainly on heads, but also on feet and background structures. Hence, theshape model has to be combined with other information cues to reduce the false positive rate.3.2. UNIFORM MOTION MODEL Motion information is the second cue of the framework to detect individual humans in video sequences. Thiskind of motion information is obtained from a dense optical flow field and is used to identify potential imageareas containing a head. By combining this information with the confidence gained from the shape modeldescribed in Section 3.1, the detection rate is increased. Figure 3(c) shows the optical flow for two consecutiveframes. The direction of motion is color-coded by using different colors for the direction of flow and differentbrightness values for its magnitude. The significant green spots are mainly related to the upper body of people,while the small regions with varying hue-values correspond to the limbs of the bodies. This observation leads usto assume that a human upper body (torso and head) moves uniformly, while we can observe a non-uniformmotion in limb regions. Furthermore, the non-uniformity of flow between people can reveal the borders betweenthem in some cases, as shown in the picture by slightly different green values. By measuring the uniformity ofmotion inside an image region we can reason about the likelihood of that region to contain a human body. A hypothesized human body region centered at is defined by a binary mask with a head-shoulder ,shape. By sliding this mask over the original image we define an area for every pixel where we hypothesize a , , we can compute the mean motion vectorhuman upper body. Since the dense optical flow field provides a motion vector for every pixellocation 1 ,inside the mask located at where is the number of pixels of the mask region. Now we can measure theprobability that a region surrounding a pixel contains uniform motion by computing the average endpointerror of every particular vector with the mean vector 1 . By sliding the binary mask over the whole image we build the probability map seen in Figure 3(c). Asshown, limb regions of humans correspond to areas with a low probability being a head candidate.
  5. 5. 5 Figure 3: Probability maps: A scene with high crowd density shows the concept of information fusion to find head candidates: (a): Input image. (b): Due to using only gradient information of one frame there are many regions with a highprobability of being heads . (c): Dense optical flow field is computed between two consecutive frames. (d): Using motion information creates another probability map . (e): Fusing both information, the modes of the resulting map can be distinguished easily and represent the head detections.3.3. INFORMATION FUSION AND MODE SEEKING In the previous section, shape and motion information is used to compute probability maps. Consideringboth cues individually it is not possible to detect heads securely, but by fusing the knowledge of both domainswe are able to detect most heads reliably. We simply merge the probability maps by a weighted linearcombination · ,where is set empirically. The final step is to detect heads from the probability map . We search for detections asmaxima in the probability map by Mean-Shift Mode Estimation as described in [5]. The mean shift procedureprovides local smoothing of the detections and clusters overlapping detections. Each cluster corresponds toparticular head detections with a confidence which is gained by accumulating the probabilities within the mean-shift kernel support. A trade-off between missing rate and false positive detections is made by choosing anadequate threshold empirically for the detection confidence.3.4. VALIDATION BY TRACKING So far only information of two subsequent frames is used. As described in Section 2.5 incorporating trackinginformation helps to enhance the reliability of an object detector by enforcing coherent detections over severalframes. At first we propagate detections into the subsequent frame by displacing it by their corresponding meanmotion vector and afterwards we associate the propagated detections with the current ones according to theirminimal distance in a greedy manner. If no propagated detection is associated to a current one, a new trajectoryis spawned. Every trajectory owns a confidence measure which is calculated from the detection confidences andthe distance of the associated detections. The measure increases by multiple successful associations anddecreases in the case of missing detections. By applying one threshold on this confidence value for labeling atrack as valid person track and another threshold for deleting an unreliable track, we enhance the robustness ofthe system.3.5. LINKING DETECTIONS USING COHERENT MOTION DETECTION By applying the detector as yet described, the system detects regions with characteristic shape and uniformmotion. Sometimes the detector gets distracted by objects like carried luggage or clothes with head shape-liketexture as shown in figure 6. In these cases multiple trajectories on one pedestrian are established. By followingthese trajectories over time, we assume that they lay on a rigid body, if they are located in a constant distance forevery time step. This assumption is based in the observation that different persons change their distance to eachother even if they walk in groups based on individual walking cycles. Following an idea published in [4] we canapproximate that coherent motion is more likely, if the variance in distance between two trajectories is small. Ifwe find multiple trajectories with coherent motion characteristics, we only keep the one with maximum heightin the image, since we suppose the head of a person at this position.
  6. 6. 6 Figure 4: Linking detections by coherent motion detection: (a): False positive detections are generated caused by characteristic shape and motion. (b): Due to their coherent motion trajectories on the same object are grouped and only one detection is kept. (Linked trajectories are marked with red lines.)4. EXPERIMENTAL RESULTS The system was evaluated by using the sequence S1-L1-Time-13-57, view 1 from the Pets2009-database.The ground truth person-count was generated by annotating every person in every frame, even the occludedones. For the training of the HOG-detector we cropped 158 heads from the INRIA database [1]. The denseoptical flow field was computed by the public available implementations of Marzat et al. [11] and Werlberger etal. [17]. The number of people was estimated in a buffered way: If the system labels a trajectory as a validperson we increase the person count back in time to the frame where the detection was seen first. Figure 6 shows the counting results. Regardless of the used optical flow method the system counts peopleaccurately in crowds with different densities. The experiments showed that both optical flow methods haveadvantages in different modules of the system. The flow obtained by [11] is more suited for detecting uniformmotion, while the implementation of [17] allows more accurate tracking results due to smoothing the flow fieldby a diffusion process. Figure 6: The estimated person count of the system generated by using different optical flow methods. We submitted the results of the overall system to the counting task of the PETS2009-workshop. Thisworkshop evaluates the algorithms of all participants on the same dataset and thus provides the possibility tocompare different approaches. The results indicate that the performance of our system is comparable to state-of-the-art methods [8].5. CONCLUSION In this paper we gave an overview of relevant state-of-the-art methods for counting people in crowdedenvironments and described a counting-by-detection method developed by our group which is based on a modelthat considers the characteristic shape and motion of a person, enhanced by means of tracking information. Theresulting trajectories of potential detections are analyzed regarding coherent motion and thus the number of falsepositives is decreased. In further work we will apply a more sophisticated data association method in order totackle the challenge of tracking individuals in crowds.
  7. 7. 76. REFERENCES [1] Inria dataset. Available online at http://lear.inrialpes.fr/data. [2] Antonio Albiol, Maria J. Silla, Alberto Albiol, and Jose Manuel Mossi. Video analysis using cornersmotion analysis. In Proc. International Workshop on Performance Evaluation of Tracking and Surveillance(PETS2009), pages 31–38, Miami, FL, USA, June 2009. [3] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool. Robust tracking-by-detection using a detector confidence particle filter. In Proc. IEEE 12th Int Computer Vision Conf, pages 1515–1522, 2009. [4] G.J. Brostow and R. Cipolla. Unsupervised bayesian detection of independent motion in crowds. InProc. IEEE Computer Society Conference on Computer Vision & Pattern Recognition (CVPR2006), pages I:594–601, 2006. [5] Dorin Comaniciu and Peter Meer. Distribution free decomposition of multivariate data. PatternAnalysis and Applications, 2:22–30, 1998. [6] Donatello Conte, Pasquale Foggia, Gennaro Percannella, Francesco Tufano, and Mario Vento. Amethod for counting moving people in video surveillance videos. EURASIP Journal on Advances in SignalProcessing, page 10 pages, 2010. [7] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Proc.International Conference on Computer Vision & Pattern Recognition (CVPR2005), volume 2, pages 886–893,INRIA Rhône-Alpes, ZIRST-655, av. de l’Europe, Montbonnot-38334, June 2005. [8] A. Ellis and J. Ferryman. Pets2010 and pets2009 evaluation of results using individual ground truthedsingle views. Advanced Video and Signal Based Surveillance, IEEE Conference on, 0:135–142, 2010. [9] Weina Ge and Robert T. Collins. Marked point processes for crowd counting. In Proc. IEEE ComputerSociety Conference on Computer Vision & Pattern Recognition (CVPR2009), pages 2913–2920, 2009. [10] Ya li Hou and G. K. H. Pang. Automated people counting at a mass site. In Proc. IEEE Int. Conf.Automation and Logistics (ICAL 2008), pages 464–469, 2008. [11] J. Marzat, Y. Dumortier, and A. Ducrot. Real-time dense and accurate parallel optical flow using cuda.In Proceedings of the 17th International Conference in Central Europe on Computer Graphics, Visualizationand Computer Vision (WSCG), pages 105–111, 2009. [12] N. Paragios and V. Ramesh. A mrf-based approach for real-time subway monitoring. In Proc. IEEEComputer Society Conf. Computer Vision and Pattern Recognition (CVPR 2001), volume 1, 2001. [13] Michael Pätzold, Rubén Heras Evangelio, and Thomas Sikora. Counting people in crowdedenvironments by fusion of shape and motion information. In IEEE Computer Society, editor, Proceedings of theIEEE International Conference on Advanced Video and Signal Based Surveillance, PETS 2010 Workshop,pages 157–164, Boston, USA, August 2010. IEEE. [14] Vincent Rabaud and Serge Belongie. Counting crowded moving objects. Proc. IEEE Computer SocietyConf. Computer Vision and Pattern Recognition (CVPR 2006), 1:705–711, 2006. [15] M. Rossi and A. Bozzoli. Tracking and counting moving people. In Proceedings of the IEEEInternational Conference on Image Processing (ICIP 1994), volume 3, pages 212–216, 1994. [16] Chris Stauffer and W.E.L. Grimson. Adaptive background mixture models for real-time tracking. InProc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR1999), pages 246–252, 1999. [17] Manuel Werlberger, Werner Trobin, Thomas Pock, Andreas Wedel, Daniel Cremers, and HorstBischof. Anisotropic huber-l1 optical flow. In British Machine Vision Conference, 2009. [18] Bo Wu and Ram Nevatia. Detection and tracking of multiple, partially occluded humans by bayesiancombination of edgelet based part detectors. Int. J. Comput. Vision, 75(2):247–266, 2007. [19] Li Zhang, Yuan Li, and R. Nevatia. Global data association for multi-object tracking using networkflows. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR 2008), pages 1–8, 2008. [20] Tao Zhao and Ram Nevatia. Bayesian human segmentation in crowded situations. Proc. IEEE Conf.Computer Vision and Pattern Recognition (CVPR 2003), 2:459, 2003.