SlideShare una empresa de Scribd logo
1 de 6
Descargar para leer sin conexión
CASSANDRA: audio-video sensor fusion for aggression detection∗

                            W. Zajdel∗         J.D. Krijnders†          T. Andringa†          D.M. Gavrila∗
                ∗
                    Intelligent Systems Laboratory, Faculty of Science, University of Amsterdam
                †
                    Auditory Cognition Group, Artificial Intelligence, Rijksuniversiteit Groningen


                          Abstract                                       in this work we combine audio- and video-sensing. At
                                                                         the low level raw sensor data are processed to compute
This paper presents a smart surveillance system named                    ‘intermediate-level” events or features that summarize ac-
CASSANDRA, aimed at detecting instances of aggressive                    tivities in the scene. Examples of such descriptors imple-
human behavior in public environments. A distinguishing                  mented in the current system are “scream” (audio) or “train
aspect of CASSANDRA is the exploitation of the compli-                   passing” and “articulation energy” (video). At the top level,
mentary nature of audio and video sensing to disambiguate                a Dynamic Bayesian Network combines the visual and au-
scene activity in real-life, noisy and dynamic environments.             ditory events and incorporates any context-specific knowl-
At the lower level, independent analysis of the audio and                edge in order to produce an aggregate aggression indica-
video streams yields intermediate descriptors of a scene                 tion. This is unlike previous work where audio-video fusion
like: ”scream”, ”passing train” or ”articulation energy”. At             dealt with speaker localization for advanced user-interfaces,
the higher level, a Dynamic Bayesian Network is used as                  i.e. using video information to direct a phased-array micro-
a fusion mechanism that produces an aggregate aggression                 phone configuration [8].
indication for the current scene. Our prototype system is
validated on a set of scenarios performed by professional
actors at an actual train station to ensure a realistic audio            2 System Description
and video noise setting.
                                                                         2.1 Audio Unit
1 Introduction                                                           Audio processing is performed in the time-frequency do-
                                                                         main with an approach common in auditory scene analy-
Surveillance technology is increasingly fielded to help safe-             sis [11]. The transformation of the time-signal to the time-
guard public spaces such as train stations, shopping malls,              frequency domain is performed by a model1 of the human
street corners, in view of mounting concerns about public                ear [3]. This model is a transmission-line model with its
safety. Traditional surveillance systems require human op-               channels tuned according to the oscillatory properties of the
erators who monitor a wall of CCTV screens for specific                   basilar membrane. Leaky-integration of the squared mem-
events that occur rarely. Advanced systems have the po-                  brane displacement results in an energy-spectrum, called a
tential to automatically filter-out spurious information and              cochleogram (see Fig. 1).
present the operator only the security-relevant data. Ex-                    A signal component is defined as a coherent area of the
isting systems have still limited capabilities; they typically           cochleogram that is very likely to stem from a single source.
perform video-based intrusion detection, and possibly some               To obtain signal components, the cochleogram is first fil-
trajectory analysis, in fairly static environments.                      tered with a matched filter which encodes the response of
    In the context of human activity recognition in dynamic              the cochlea to a perfect sinusoid of the frequency applicable
environments, we focus on the relatively unexplored prob-                to that segment. Then the cochleogram is thresholded using
lem of aggression detection. Earlier work involved solely                a cut-off value based on two times the standard deviation of
the video domain and considered fairly controlled in-door                the energy values. Signal components are obtained as the
environments with static background and few (two) per-                   tracks formed by McAulay-Quatari tracking [7], applied on
sons [2]. Because events associated with the build-up or                 this pre-processed version of the cochleogram. This entails
enactment of aggression are difficult to detect by a single               stringing the energy maxima of connected components to-
sensor modality (e.g. shouting versus hitting-someone),                  gether, over the successive frames.
   ∗ This research was supported by the Dutch Science Foundation NWO         1 We thank Sound Intelligence (http://www.soundintel.com) for con-

under grant 634.000.432 within the ToKeN2000 program.                    tributing this model and cooperation on the aggression detection methods.




   978-1-4244-1696-7/07/$25.00 ©2007 IEEE.                        200
principled approach for detecting aggression involves de-
                                                                  tailed body-pose estimation, possibly in 3D, followed by
                                                                  ballistic analysis of movements of body parts. Unfortu-
                                                                  nately, at present pose estimation remains a significant com-
                                                                  putational challenge. Various approaches [4] operate at lim-
                                                                  ited rates and handle mostly a single person in a constrained
                                                                  setting (limited occlusions, pre-fitted body model).
                                                                      Simplified approaches rely on a coarser representation
                                                                  human body. An example is a system [2] that tracks a head
                                                                  of a person by analyzing body contour and correlates ag-
                                                                  gression with head’s “jerk” (derivative of acceleration). In
                                                                  practice, high-order derivatives related to body contours are
                                                                  difficult to estimate robustly in cases where the background
                                                                  is not static and there is a possibility of occlusion.
Figure 1: Typical cochleograms of aggressive and normal
speech (top and bottom figure, respectively). Energy con-          2.2.1 Visual aggression features
tent is color-coded (increasing from blue to red). Note the
higher pitch (marked by white lines) and the more pro-            Here we consider alternative cues based on an intuitive ob-
nounced higher harmonics for the aggressive speech case.          servation that aggressive behavior leads to highly energetic
                                                                  body articulation. We estimate (pseudo-) kinetic energy of
                                                                  body parts using a “bag-of-points” body model. The ap-
   Codeveloping sinusoids with a frequency development            proach relies on simple image processing operations and
equal to an integer multiple of a fundamental frequency           yields features highly correlated with aggression.
(harmonics) are subsequently combined into harmonic                   For detecting people our video subsystem employs adap-
complexes. Note that these harmonics can be combined              tive background/foreground subtraction technique [12].
safely because the probability is small that uncorrelated         The assumption of static background scene holds fairly well
sound sources show this measure of correlation by chance.         in the center view area, where most of the people enter the
   Little or no literature exists on the influence of aggres-      scene. After detection, people are represented as ellipses
sion on the properties of speech. However the Component           and tracked with an extended version of the mean-shift
Process Model from Scherer [9] and similarities with the          tracking algorithm [13]. The extended tracker adapts po-
Lombard reflex [6] suggest a couple of important cues for          sition and shape of the ellipse (tilt, axes) and thus facilitates
aggression. The component process theory assumes that             a close approximation of body area even for tilted/bended
anger and panic, emotions strongly related to aggression,         poses. Additionally, the mean-shift tracker handles well
are seen as a ergotropic arousal. This form of arousal is         partial occlusions and achieves near real-time performance.
accompanied by an increase in heart frequency, blood pres-            We consider human body as a collection of loosely con-
sure, transpiration and associated hormonal activity. The         nected points with identical mass. While such a model is
predictions given by the model show many similarities with        clearly a simplification, it reflects well the non-rigid nature
the Lombard reflex.the vocal chords increases the pitch and        of a body and facilitates fast computations. Assuming Q
enhances the higher harmonics, which leads to an increase         points attached to various body-parts, the average kinetic
in spectral tilt. These properties, pitch (fundamental fre-       energy of an articulating body is given by the average ki-
quency (f0 )) and spectral tilt (a measure of the slope of the    netic energy of points,
average energy distribution, calculated as the the energy of
the harmonics above 500 Hz divided by the energy of the                                      Q
                                                                                         1         1
harmonics below 500 Hz) are calculated from the harmonic                           E=                mi |vi − ve |2            (1)
                                                                                         Q         2
complexes. The audio detector uses these two properties as                                   i=1
input for a decision tree. An example of normal and aggres-
                                                                  where vi , mi denote, respectively, velocity vector and mass
sive speech can be seen in Fig. 1.
                                                                  of the ith point, and ve denotes the velocity vector of the
                                                                  ellipse representing the person. By discounting overall body
2.2 Video Unit
                                                                  motion we capture only the articulation energy.
Analysis of the video stream aims primarily at computing              Due to the assumption of uniform mass distribution be-
visual cues characteristic for physical aggression among hu-      tween points, the total body mass becomes a scale factor.
mans. Physical aggression is usually characterized by fast        By omitting scale factors, we obtain a pseudo-kinetic en-
articulation of body parts (i.e. arms, legs). Therefore, a                                     ¯    1    Q
                                                                  ergy estimate in the form E = Q i=1 |vi − ve |2 . Such




                                                            201
Given the noisy and ambiguous domain we resort to a
                                                                   probabilistic formulation. The fusion unit employs a proba-
                                                                   bilistic time-series model (a Dynamic Bayesian Network,
                                                                   DBN [5]), where aggression level can be estimated in a
                                                                   principled way by solving appropriate inference problem.

                                                                   2.3.1 Basic model
                                                                   We denote the discrete-time index as k = 1, 2, . . ., and set
Figure 2: (Left) Optical-flow features for detecting trains in      the gap between discrete-time steps (clock ticks) to 50 ms.
motion. (Right) Representing people: ellipses (for tracking)                            a
                                                                   At the kth step, yk ∈ {0, 1} denotes the output of audio
and points (for articulation features).                                                          v
                                                                   aggression detector, and yj,k denotes the pseudo-kinetic en-
                                                                   ergy of the jth, j = 1, . . . , J, person. Our system can com-
features are assumed to measure articulation and will be our       prise several train detectors monitoring non-overlapping rail
primary visual cues for aggression detection.                      sections. The binary output of the mth, m = 1, . . . , 4 = M ,
                                                                                                          T
   Computation of energy features requires selecting image         train detector will be denoted as ym,k ∈ {0, 1}. (We tested
points that represent a person. Ideally, the points would          a configuration with M = 4.)
cover the limbs and the head since these parts are mostly ar-          Our aim is to detect ”ambient” scene aggression, without
ticulated. Further, to estimate velocities, the selected points    deciding precisely which persons are aggressive. Therefore
must be easy to track. Accordingly, we select Q = 100              we reason on the basis of a cumulative articulation mea-
                                                                                v         v
points within an extended bounding box of a person by find-         surement yk = j yj,k over all persons. Additionally, the
ing pixels with the most local contrast [10]. Such points are      cumulative is quite robust to (near-)occlusions when articu-
easy to track and usually align well with edges in an image        lation of one person could be wrongly attributed to another.
(which in turn often coincide with limbs as in Fig. 2, right).         In order to reason about aggression level, we use a 5-step
For point tracking we use the KLT algorithm [10] (freely           discrete scale 0, 1 : 0.0 (no activity), 0.2 (normal activity),
available implementation from the OpenCV library).                 0.4 (attention required), 0.6 (minor disturbance), 0.8 (major
                                                                   disturbance), and 1.0 (critical aggression).
2.2.2 Train detection                                                  Importantly, the aggression level obeys specific corre-
                                                                   lations over time and should be represented as a process
An additional objective of the video unit is detecting trains
                                                                   (rather than an instantaneous quantity). We will denote ag-
in motion. Trains moving in and out of a station produce
                                                                   gression level at step k as ak and define a stochastic process
auditory noise that often leads to spurious audio-aggression
                                                                   {ak } with dynamics given by a 1st order model:
detections. Therefore recognizing trains in video opens a
possibility for later suppressing of such detections.                          p(ak+1 = i|ak = j) = CPTa (i, j),              (2)
   A train usually appears as a large, rigid body and moves
along a constrained trajectory. For a given view and rail          where CPTa (i, j), denotes a conditional probability table.
section we define a mask that indicates the image regions           In a sense, the first order model is a simplification as it cap-
where a train typically appears. In this region we track           tures only short-term dependencies.
frame-to-frame motion of N = 100 image features with                                           v                 a
                                                                      The measured visual (yk ) and auditory (yk ) features are
KLT [10] tracker (Fig. 2, left). The features’ motion vec-         treated as samples from an observation distribution (model)
tors are classified as train/non-train by a pre-trained nearest-    that depends on the aggression level ak . Since (later on) we
neighbor classifier. A train in motion is detected when more        will incorporate information about passing trains, we intro-
then 50% of the features are classified positively. Due to the      duce a latent train-noise indicator variable nk ∈ {0, 1} and
constrained movement of trains, our detector turns out quite       assume that the observation model
robust to occasional occlusions of the train area by people.
                                                                                            v a
                                                                                         p(yk , yk |ak , nk )
2.3 Fusion Unit
                                                                   depends also on the train-noise indicator. The model takes
The fusion unit produces an aggregate aggression indication
                                                                   the form of a conditional probability table CPTo , where the
given the features/events produced independently by the au-
                                                                   cumulative articulation feature is discretized.
dio and video subsystems. A fundamental difficulty with fu-
sion arises from inevitable ambiguities in human behavior          2.3.2 Train models
which make it difficult to separate normal from aggressive
activities (even for a human observer). Additional problems        The fusion DBN comprises several subnetworks — train
                                                                                                         T
follow from various noise artifacts in the sensory data.           models which couple train detections ym,k with the latent




                                                             202
train-noise indicator nk . Additionally, each train model en-                                     T
                                                                                                 ym,k     T
                                                                                                         ym,k+1
                                                                                                                          M

codes prior information about duration of a train pass.
   For the mth rail section, we introduce a latent indicator
                                                                                         ···     τm,k    τm,k+1     ···
im,k ∈ {0, 1} of a train passing at step k. We assume that
                      T
the train detections ym,k , the train-pass indicators im,k , and
the train noise nk obey a probabilistic relation                                         ···     im,k    im,k+1     ···

                 T                    T
              p(ym,k |im,k ) = CPTt (ym,k , im,k )             (3)
              p(nk |i1:M,k ) = CPTn (nk , i1:M,k ).            (4)                                nk      nk+1


For each rail, the model (3) encodes inaccuracies of detector                            ···      ak      ak+1      ···
(mis-detections, false alarms). The model (4) represents the
fact that passing trains usually induce noise, but also that                                      yk      yk+1
sometimes noise is present without a passing train.
   Since a typical pass takes 5 − 10 seconds (100 − 200                Figure 3: Dynamic Bayesian Network representing the
steps) the pass indicator variable exhibits strong temporal            probabilistic fusion model. The rectangular plate indicates
correlations. We represent such correlations with a time-              M = 4 replications of the train sub-network.
series model based on a gamma distribution. A gamma pdf
γ(τm , αm , βm ) is a convenient choice for modeling dura-
tion τm of an event (αm , βm are parameters). To apply this                Given the graphical structure of the model (Fig. 3),
model in a time-series formulation, we replace the total du-           the required distribution can be efficiently computed us-
ration τm with a partial duration τm,k that indicates how              ing a recursive, forward filtering procedure [5]. We im-
long a train is already passing a scene at step k.                     plemented an approximate variant of the filtering proce-
   By considering a joint process {im,k , τm,k } temporal              dure, known as the Boyen-Koller algorithm [1]. At a given
correlations can be enforced by the following model                    step k, the algorithm maintains only marginal distributions
                                                                               v     a      t
                                                                       p(hk |y1:k , y1:k , y1:m,1:k ), where hk is any of the latent vari-
 p(im,k+1 = 1|τm,k , im,k = 0) = ηm                                    ables. When new detector data arrive, the current-step
 p(im,k+1 = 1|τm,k , im,k = 1) = p(τm > τm,k ) =                       marginals are updated to represent the next-step marginals.
         +∞                                                                An important modeling aspect are temporal develop-
   =          γ(τm , αm , βm )dτm = 1 − F (τm,k , αm , βm ),           ments of processes in the scene. Unlike the binary train-
       τm,k                                                            pass events, aggression level usually undergoes more sub-
                                                                       tle evolutions as the tension and anger among people build
where F () is a gamma cumulative density function. Param-
                                                                       up. Since the assumed (1st-order) model might not capture
eter ηm denotes a probability of starting a new train pass.
                                                                       well long-term effects and a stronger model would be rather
At the kth step, the probability of continuing a pass is func-
                                                                       complicated we enforce temporal correlations with a simple
tion of the current duration of the pass. A configuration
                                                                       low-pass filter. The articulation measurements (before in-
(im,k+1 = 1, τm,k , im,k = 1) implies that a pass does not
                                                                       ference) and the expected aggression level (after inference)
finish yet and the total pass duration will be larger than τm,k ,
                                                                       are low-pass filtered using a 10 s running-average filter.
hence the integration. Further, the partial duration variable
                                                                           The parameters of our model: probability tables: CPTa ,
obeys a deterministic regime
                                                                       CPTo , CPTn , CPTt and the parameters αm , βm of the
                   0                 iff im,k+1 = 0                    gamma pdf’s) are estimated by maximum-likelihood learn-
   τm,k+1 =                                         ,                  ing. The learning process relies on detector measurements
                   τm,k+1 = τm,k + ǫ otherwise
                                                                       from training audio-video clips and ground-truth annota-
where ǫ = 50 ms is the period between successive steps.                tions. The annotations are particularly important for learn-
                                                                       ing the observation model CPTo . An increased probability
2.3.3 Inference and Learning                                           of a false auditory aggression in the presence of train noise,
                                                                       will suppress the contribution of audio data to the aggres-
In a probabilistic framework, reasoning about aggression
                                                                       sion level when the video subsystem reports a passing train.
corresponds to solving probabilistic inference problems. In
an online mode, the key quantity of interest is the posterior
distribution on aggression level given data collected at up            3 Experiments
                            v      a      t
to the current step, p(ak |y1:k , y1:k , y1:m,1:k ). From this dis-
tribution we calculate the expected aggression value, which            We evaluate the aggression detection system using a set of
will be the basic output of the fusion system.                         13 audio-video clips (scenarios) recorded at a train station.




                                                                 203
audio−aggression detections
                                                                                                  1




                                                                             binary value
                                                                                                0.5


                                                                                                  0
                                                                                                      0   20     40        60       80       100      120        140      160

                                                                                                                         visual agrression features

                                                                                                           raw




                                                                  pseudo−energy
                                                                                      1000
                                                                                                           low−pass

                                                                                                500


                                                                                                  0
                                                                                                      0   20     40        60       80       100      120        140      160

                                                                                                                posterior expected value of aggression level

Figure 4: A screen-shot of the CASSANDRA prototype                                                1




                                                                             aggression scale
system. In the lower part of the right window, various in-
                                                                                                0.5                                                            ground−truth
termediate quantities are shown: the probabilities for trains                                                                                                  low−pass
on various tracks (currently zero), the output of the video-                                                                                                   raw
                                                                                                  0
based articulation energy (”People”, currently mid-way),                                              0   20     40        60        80      100      120        140      160
the output of the audio detector (currently at maximum).                                                                          time [s]

The slider covering the right side shows the overall esti-
mated aggression level (currently ”major disturbance”).


The clips (each 100s − 150s) feature 2-4 professional ac-
tors who engage in a variety of activities ranging from nor-
mal (walking) through slightly excited (shouting, running,
hugging), moderate aggressive (pushing, hitting a vending
machine) to critically aggressive (football-supporters clash-
ing). The recording took place at a platform of an actual
train station (between two rail tracks, partially outdoor) and
therefore incorporates realistic artifacts, like noise and vi-
brations from trains, variable illumination, wind, etc.
   Scenarios have been manually annotated with a ground-
truth aggression level by two independent observers using
                                                                  Figure 5: Aggression build-up scenario. (Top graph)
the scale mentioned in Sect. 2.3.1. Aggression toward ob-
                                                                  Audio-aggression detections. (Middle graph) Visual ar-
jects was rated approx. 25% lower than aggression toward
                                                                  ticulation measurements. (Bottom graph) Estimated and
humans, i.e. the former did not exceed a level of 0.8.
                                                                  ground-truth aggression level. The gray lines show un-
   Fig. 5 details the results of the CASSANDRA system             certainty intervals (2× std. deviation) around raw (before
on a scenario involving gradual aggression build-up. Here,        filtering) expected level. (Images) Several frames (times-
two pairs of competing supporters first start arguing, then        tamps: 45 s, 77 s, 91 s, 95 s). Notice correspondence with
get in a fight. The bottom panel shows some illustrative           the articulation measurements.
frames, with articulation features highlighted. We see from
Fig. 5 that the raw (before low-pass filtering) articulation
measurements are rather noisy, however the low-pass filter-
ing reveals strong correlation with ground-truth aggression
level. The effect of low-pass filtering of the estimated ag-
gression level is shown bottom plot of Fig. 5. A screen-shot
of the CASSANDRA system in action is shown in Fig. 4.
   We considered two quantitative criteria to evaluate our
system. The first is the deviation of the CASSANDRA es-
timated aggression level from the ground-truth annotation.        Figure 6: Selected frames from a scenario involving aggres-
Here we obtained a deviation of mean 0.17 with standard           sion toward a machine.




                                                            204
ground-truth       detected events
                            Cumulative system performace                  id   scenario content                      positive     true-pos.    false-pos.
                      80                                                  1    normal: walking, greeting                0             0             0
                                                         fusion           2    normal: walking, greeting                0             0             1
                                                                          3    excited: lively argument                 0             0             0
                      70                                                  4    excited: lively argument                 1             1             0
        detection %


                                                         video            5    aggression toward a vend. machine        1             0             1
                                                                          6    aggression toward a vend. machine        1             0             0
                      60                                                  7    happy football supporters                1             1             0
                                                                          8    happy football supporters                0             0             1
                                                                          9    supporters harassing a passenger         1             1             0
                      50                                                  10   supporters harassing a passenger         1             1             0
                                                                          11   two people fight, third intervenes        1             1             0
                                audio                                     12   four people fighting                      1             1             0
                      40                                                  13   four people fighting                      1             1             1
                        0        5       10        15         20
                                false−positives / hour
                                                                         Table 1: Aggression detection results by scenario. The table
                                                                         indicates number of events (positive=aggressive). Figure 6
Figure 7: Cumulative detection results by sensor modality.               shows example frames from the 5th scenario.

deviation 0.1. The second performance criterion considers                realtime.
aggression detection as a two-class classification problem of                The present system is able to distinguish well between
distinguishing between ”normal” and ”aggressive” events                  clear cases of aggression and normal behavior. In the future,
(by thresholding aggression level at 0.5). Matching ground-              we plan to focus increasingly on the ”borderline” cases. For
truth with estimated events allows us to compute detection               this, we expect to use more elaborate auditory cues (laugh-
rate (%) and false-alarm rate (per hour). When matching                  ter vs scream), more detailed visual cues (indications of
events we allowed a time deviation of 10 s. The cumula-                  body-contact, partial body-pose estimation), and stronger
                                                                         use of context information.
tive results of leave-one-out tests on 13 scenarios (12 for
training, 1 for testing) are given in Fig. 7. Comparing the
test results for three modalities (audio, video, fusion of au-           References
dio+video), we notice that the auditory and visual features               [1] X. Boyen and D. Koller. Tractable inference for complex
indeed are complimentary; with fusion the overall detection                   stochastic processes. In UAI, 1998.
rate increased without introducing additional false alarms.
                                                                          [2] A. Datta et al. Person-on-person violence detection in video
It is important to note that our data set is heavily biased                   data. ICPR, 01:10433, 2002.
toward occurrences of aggression, i.e. which put the sys-
                                                                          [3] H. Duifhuis et al. Cochlear Mechanisms: Structure, Func-
tem to a difficult test. We expect CASSANDRA to pro-                           tion and Models, pages 395–404. 1985.
duce much less false alarms in a typical surveillance setting,
                                                                          [4] D.M. Gavrila. The Visual analysis of human movement: A
where most of the time nothing happens.
                                                                              survey. CVIU, 73(1):82–98, 1999.
    Table 1 gives an overview of the detection results on the
                                                                          [5] Z. Ghahramani. An introduction to hidden Markov models
scenarios. We notice that the system performed well on
                                                                              and Bayesian networks. IJPRAI, 15(1):9–42, 2001.
the clearly normal cases (scenarios 1-3) or aggressive cases
(scenarios 9-13), while borderline scenarios were more dif-               [6] J.-C. Junqua. The lombard reflex and its role on human lis-
                                                                              teners and automatic speech recognizers. The Journal of the
ficult to classify. The borderline behavior (e.g. scenarios
                                                                              Acoustical Society of America, 93(1):510–524, 1 1993.
7-8) turns out also difficult to classify for human observers
given the inconsistent ground-truth annotation in Tab. 1.                 [7] R. McAulay and T. Quatieri. Speech analysis / synthesis
                                                                              based on a sinusoidal representation. Acoustics, Speech, and
    The CASSANDRA system runs on two PCs (one with
                                                                              Signal Processing, 34(4):744–754, 1986.
the video and fusion units, the other with the audio unit).
                                                                          [8] A. Pentland. Smart rooms. Scientific American, 274(4):54–
The overall processing rate is approx. 5 Hz with 756×560×
                                                                              62, 1996.
20 Hz input video stream and 44 kHz input audio stream.
                                                                          [9] K. Scherer. Vocal affect expression: A review and a model
                                                                              for future research. Psychol. Bulletin, 99(2):143–165, 1986.
4 Conclusions and Future Work
                                                                         [10] J. Shi and C. Tomasi. Good features to track. In CVPR, 1994.
We demonstrated a prototype system that uses a Dynamical
                                                                         [11] D. Wang and G. Brown. Computational Auditory Scene
Bayesian Network to fuse auditory and visual information                      Analysis. John Wiley and Sons, 2006.
for detecting aggressive behavior. On the auditory side, the
                                                                         [12] Z. Zivkovic. Improved adaptive gausian mixture model for
system relies on scream-like cues, and on the video side,
                                                                              background subtraction. In ICPR, 2004.
the system uses motion features related to articulation. We
                                                                         [13] Z. Zivkovic and B. Krose. An EM-like algorithm for color-
obtained a promising aggression detection performance in a
                                                                              histogram-based object tracking. In CVPR, 2004.
complex, real-world train station setting, operating in near-




                                                                   205

Más contenido relacionado

La actualidad más candente

Color-plus-Depth Level-of-Detail in 3D Tele-immersive Video: A Psychophysical...
Color-plus-Depth Level-of-Detail in 3D Tele-immersive Video: A Psychophysical...Color-plus-Depth Level-of-Detail in 3D Tele-immersive Video: A Psychophysical...
Color-plus-Depth Level-of-Detail in 3D Tele-immersive Video: A Psychophysical...Wanmin Wu
 
Convolutional neural networks
Convolutional neural networksConvolutional neural networks
Convolutional neural networksSlobodan Blazeski
 
A New Approach for video denoising and enhancement using optical flow Estimation
A New Approach for video denoising and enhancement using optical flow EstimationA New Approach for video denoising and enhancement using optical flow Estimation
A New Approach for video denoising and enhancement using optical flow EstimationIRJET Journal
 
Artificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSArtificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSREHMAT ULLAH
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsSangmin Woo
 
A Deep Belief Network Approach to Learning Depth from Optical Flow
A Deep Belief Network Approach to Learning Depth from Optical FlowA Deep Belief Network Approach to Learning Depth from Optical Flow
A Deep Belief Network Approach to Learning Depth from Optical FlowReuben Feinman
 
Convolutional neural networks
Convolutional neural networksConvolutional neural networks
Convolutional neural networksAshutosh Kumar
 
Erase icitst
Erase icitstErase icitst
Erase icitstnashvasan
 
On the Development of a Brain Simulator
On the Development of a Brain SimulatorOn the Development of a Brain Simulator
On the Development of a Brain SimulatorJimmy Lu
 
Fcv learn sudderth
Fcv learn sudderthFcv learn sudderth
Fcv learn sudderthzukun
 
Clustering Human Behaviors with Dynamic Time Warping and Hidden Markov Models...
Clustering Human Behaviors with Dynamic Time Warping and Hidden Markov Models...Clustering Human Behaviors with Dynamic Time Warping and Hidden Markov Models...
Clustering Human Behaviors with Dynamic Time Warping and Hidden Markov Models...Kan Ouivirach, Ph.D.
 
Biometric simulator for visually impaired (1)
Biometric simulator for visually impaired (1)Biometric simulator for visually impaired (1)
Biometric simulator for visually impaired (1)Rahul Bhagat
 
Pure Coordination Using the Coordinator-Configurator Pattern
Pure Coordination Using the Coordinator-Configurator PatternPure Coordination Using the Coordinator-Configurator Pattern
Pure Coordination Using the Coordinator-Configurator PatternSerge Stinckwich
 
IRJET-AI Neural Network Disaster Recovery Cloud Operations Systems
IRJET-AI Neural Network Disaster Recovery Cloud Operations SystemsIRJET-AI Neural Network Disaster Recovery Cloud Operations Systems
IRJET-AI Neural Network Disaster Recovery Cloud Operations SystemsIRJET Journal
 
Application Note - Neuroscience: Peripheral Nerve Imaging
Application Note - Neuroscience: Peripheral Nerve ImagingApplication Note - Neuroscience: Peripheral Nerve Imaging
Application Note - Neuroscience: Peripheral Nerve ImagingFUJIFILM VisualSonics Inc.
 
Evolving Neural Network Agents In The Nero Video
Evolving Neural Network Agents In The Nero VideoEvolving Neural Network Agents In The Nero Video
Evolving Neural Network Agents In The Nero VideoStelios Petrakis
 
Single Image Super Resolution using Interpolation and Discrete Wavelet Transform
Single Image Super Resolution using Interpolation and Discrete Wavelet TransformSingle Image Super Resolution using Interpolation and Discrete Wavelet Transform
Single Image Super Resolution using Interpolation and Discrete Wavelet Transformijtsrd
 

La actualidad más candente (20)

Color-plus-Depth Level-of-Detail in 3D Tele-immersive Video: A Psychophysical...
Color-plus-Depth Level-of-Detail in 3D Tele-immersive Video: A Psychophysical...Color-plus-Depth Level-of-Detail in 3D Tele-immersive Video: A Psychophysical...
Color-plus-Depth Level-of-Detail in 3D Tele-immersive Video: A Psychophysical...
 
Convolutional neural networks
Convolutional neural networksConvolutional neural networks
Convolutional neural networks
 
A New Approach for video denoising and enhancement using optical flow Estimation
A New Approach for video denoising and enhancement using optical flow EstimationA New Approach for video denoising and enhancement using optical flow Estimation
A New Approach for video denoising and enhancement using optical flow Estimation
 
Artificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSArtificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKS
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
 
A Deep Belief Network Approach to Learning Depth from Optical Flow
A Deep Belief Network Approach to Learning Depth from Optical FlowA Deep Belief Network Approach to Learning Depth from Optical Flow
A Deep Belief Network Approach to Learning Depth from Optical Flow
 
Convolutional neural networks
Convolutional neural networksConvolutional neural networks
Convolutional neural networks
 
Erase icitst
Erase icitstErase icitst
Erase icitst
 
K0966468
K0966468K0966468
K0966468
 
On the Development of a Brain Simulator
On the Development of a Brain SimulatorOn the Development of a Brain Simulator
On the Development of a Brain Simulator
 
Fcv learn sudderth
Fcv learn sudderthFcv learn sudderth
Fcv learn sudderth
 
Clustering Human Behaviors with Dynamic Time Warping and Hidden Markov Models...
Clustering Human Behaviors with Dynamic Time Warping and Hidden Markov Models...Clustering Human Behaviors with Dynamic Time Warping and Hidden Markov Models...
Clustering Human Behaviors with Dynamic Time Warping and Hidden Markov Models...
 
Biometric simulator for visually impaired (1)
Biometric simulator for visually impaired (1)Biometric simulator for visually impaired (1)
Biometric simulator for visually impaired (1)
 
Matlab abstract 2016
Matlab abstract 2016Matlab abstract 2016
Matlab abstract 2016
 
Pure Coordination Using the Coordinator-Configurator Pattern
Pure Coordination Using the Coordinator-Configurator PatternPure Coordination Using the Coordinator-Configurator Pattern
Pure Coordination Using the Coordinator-Configurator Pattern
 
A2 l
A2 lA2 l
A2 l
 
IRJET-AI Neural Network Disaster Recovery Cloud Operations Systems
IRJET-AI Neural Network Disaster Recovery Cloud Operations SystemsIRJET-AI Neural Network Disaster Recovery Cloud Operations Systems
IRJET-AI Neural Network Disaster Recovery Cloud Operations Systems
 
Application Note - Neuroscience: Peripheral Nerve Imaging
Application Note - Neuroscience: Peripheral Nerve ImagingApplication Note - Neuroscience: Peripheral Nerve Imaging
Application Note - Neuroscience: Peripheral Nerve Imaging
 
Evolving Neural Network Agents In The Nero Video
Evolving Neural Network Agents In The Nero VideoEvolving Neural Network Agents In The Nero Video
Evolving Neural Network Agents In The Nero Video
 
Single Image Super Resolution using Interpolation and Discrete Wavelet Transform
Single Image Super Resolution using Interpolation and Discrete Wavelet TransformSingle Image Super Resolution using Interpolation and Discrete Wavelet Transform
Single Image Super Resolution using Interpolation and Discrete Wavelet Transform
 

Similar a Cassandra audio-video sensor fusion for aggression detection

A Fuzzy Watermarking Approach Based on Human Visual System
A Fuzzy Watermarking Approach Based on Human Visual SystemA Fuzzy Watermarking Approach Based on Human Visual System
A Fuzzy Watermarking Approach Based on Human Visual SystemCSCJournals
 
Ryan Match Moving For Area Based Analysis Of Eye Movements In Natural Tasks
Ryan Match Moving For Area Based Analysis Of Eye Movements In Natural TasksRyan Match Moving For Area Based Analysis Of Eye Movements In Natural Tasks
Ryan Match Moving For Area Based Analysis Of Eye Movements In Natural TasksKalle
 
Bt9403 virtual reality
Bt9403 virtual realityBt9403 virtual reality
Bt9403 virtual realitysmumbahelp
 
Wireless Sensor Network using Particle Swarm Optimization
Wireless Sensor Network using Particle Swarm OptimizationWireless Sensor Network using Particle Swarm Optimization
Wireless Sensor Network using Particle Swarm Optimizationidescitation
 
Liu Natural Scene Statistics At Stereo Fixations
Liu Natural Scene Statistics At Stereo FixationsLiu Natural Scene Statistics At Stereo Fixations
Liu Natural Scene Statistics At Stereo FixationsKalle
 
Action Recognition using Nonnegative Action
Action Recognition using Nonnegative ActionAction Recognition using Nonnegative Action
Action Recognition using Nonnegative Actionsuthi
 
Human activity recognition updated 1 - Copy.pptx
Human activity recognition updated 1 - Copy.pptxHuman activity recognition updated 1 - Copy.pptx
Human activity recognition updated 1 - Copy.pptxBhaveshKhuje
 
Purkinje imaging for crystalline lens density measurement
Purkinje imaging for crystalline lens density measurementPurkinje imaging for crystalline lens density measurement
Purkinje imaging for crystalline lens density measurementPetteriTeikariPhD
 
Resource aware and reliable cluster based communication scheme for wireless b...
Resource aware and reliable cluster based communication scheme for wireless b...Resource aware and reliable cluster based communication scheme for wireless b...
Resource aware and reliable cluster based communication scheme for wireless b...IAEME Publication
 
19 3 sep17 21may 6657 t269 revised (edit ndit)
19 3 sep17 21may 6657 t269 revised (edit ndit)19 3 sep17 21may 6657 t269 revised (edit ndit)
19 3 sep17 21may 6657 t269 revised (edit ndit)IAESIJEECS
 
19 3 sep17 21may 6657 t269 revised (edit ndit)
19 3 sep17 21may 6657 t269 revised (edit ndit)19 3 sep17 21may 6657 t269 revised (edit ndit)
19 3 sep17 21may 6657 t269 revised (edit ndit)IAESIJEECS
 
Multi sensor data fusion system for enhanced analysis of deterioration in con...
Multi sensor data fusion system for enhanced analysis of deterioration in con...Multi sensor data fusion system for enhanced analysis of deterioration in con...
Multi sensor data fusion system for enhanced analysis of deterioration in con...Sayed Abulhasan Quadri
 
Analytical Review on the Correlation between Ai and Neuroscience
Analytical Review on the Correlation between Ai and NeuroscienceAnalytical Review on the Correlation between Ai and Neuroscience
Analytical Review on the Correlation between Ai and NeuroscienceIOSR Journals
 
IRJET- Survey on Detection of Crime
IRJET-  	  Survey on Detection of CrimeIRJET-  	  Survey on Detection of Crime
IRJET- Survey on Detection of CrimeIRJET Journal
 
26.motion and feature based person tracking
26.motion and feature based person tracking26.motion and feature based person tracking
26.motion and feature based person trackingsajit1975
 
Maximizing Strength of Digital Watermarks Using Fuzzy Logic
Maximizing Strength of Digital Watermarks Using Fuzzy LogicMaximizing Strength of Digital Watermarks Using Fuzzy Logic
Maximizing Strength of Digital Watermarks Using Fuzzy Logicsipij
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 

Similar a Cassandra audio-video sensor fusion for aggression detection (20)

A Fuzzy Watermarking Approach Based on Human Visual System
A Fuzzy Watermarking Approach Based on Human Visual SystemA Fuzzy Watermarking Approach Based on Human Visual System
A Fuzzy Watermarking Approach Based on Human Visual System
 
Image recognition
Image recognitionImage recognition
Image recognition
 
Ryan Match Moving For Area Based Analysis Of Eye Movements In Natural Tasks
Ryan Match Moving For Area Based Analysis Of Eye Movements In Natural TasksRyan Match Moving For Area Based Analysis Of Eye Movements In Natural Tasks
Ryan Match Moving For Area Based Analysis Of Eye Movements In Natural Tasks
 
Bt9403 virtual reality
Bt9403 virtual realityBt9403 virtual reality
Bt9403 virtual reality
 
Wireless Sensor Network using Particle Swarm Optimization
Wireless Sensor Network using Particle Swarm OptimizationWireless Sensor Network using Particle Swarm Optimization
Wireless Sensor Network using Particle Swarm Optimization
 
Liu Natural Scene Statistics At Stereo Fixations
Liu Natural Scene Statistics At Stereo FixationsLiu Natural Scene Statistics At Stereo Fixations
Liu Natural Scene Statistics At Stereo Fixations
 
Action Recognition using Nonnegative Action
Action Recognition using Nonnegative ActionAction Recognition using Nonnegative Action
Action Recognition using Nonnegative Action
 
Human activity recognition updated 1 - Copy.pptx
Human activity recognition updated 1 - Copy.pptxHuman activity recognition updated 1 - Copy.pptx
Human activity recognition updated 1 - Copy.pptx
 
Purkinje imaging for crystalline lens density measurement
Purkinje imaging for crystalline lens density measurementPurkinje imaging for crystalline lens density measurement
Purkinje imaging for crystalline lens density measurement
 
Resource aware and reliable cluster based communication scheme for wireless b...
Resource aware and reliable cluster based communication scheme for wireless b...Resource aware and reliable cluster based communication scheme for wireless b...
Resource aware and reliable cluster based communication scheme for wireless b...
 
19 3 sep17 21may 6657 t269 revised (edit ndit)
19 3 sep17 21may 6657 t269 revised (edit ndit)19 3 sep17 21may 6657 t269 revised (edit ndit)
19 3 sep17 21may 6657 t269 revised (edit ndit)
 
19 3 sep17 21may 6657 t269 revised (edit ndit)
19 3 sep17 21may 6657 t269 revised (edit ndit)19 3 sep17 21may 6657 t269 revised (edit ndit)
19 3 sep17 21may 6657 t269 revised (edit ndit)
 
Multi sensor data fusion system for enhanced analysis of deterioration in con...
Multi sensor data fusion system for enhanced analysis of deterioration in con...Multi sensor data fusion system for enhanced analysis of deterioration in con...
Multi sensor data fusion system for enhanced analysis of deterioration in con...
 
Analytical Review on the Correlation between Ai and Neuroscience
Analytical Review on the Correlation between Ai and NeuroscienceAnalytical Review on the Correlation between Ai and Neuroscience
Analytical Review on the Correlation between Ai and Neuroscience
 
1 introduction
1 introduction1 introduction
1 introduction
 
IRJET- Survey on Detection of Crime
IRJET-  	  Survey on Detection of CrimeIRJET-  	  Survey on Detection of Crime
IRJET- Survey on Detection of Crime
 
D018112429
D018112429D018112429
D018112429
 
26.motion and feature based person tracking
26.motion and feature based person tracking26.motion and feature based person tracking
26.motion and feature based person tracking
 
Maximizing Strength of Digital Watermarks Using Fuzzy Logic
Maximizing Strength of Digital Watermarks Using Fuzzy LogicMaximizing Strength of Digital Watermarks Using Fuzzy Logic
Maximizing Strength of Digital Watermarks Using Fuzzy Logic
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 

Más de João Gabriel Lima

Deep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationDeep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationJoão Gabriel Lima
 
Aplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackAplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackJoão Gabriel Lima
 
Realidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitRealidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitJoão Gabriel Lima
 
Big data e Inteligência Artificial
Big data e Inteligência ArtificialBig data e Inteligência Artificial
Big data e Inteligência ArtificialJoão Gabriel Lima
 
Mineração de Dados no Weka - Regressão Linear
Mineração de Dados no Weka -  Regressão LinearMineração de Dados no Weka -  Regressão Linear
Mineração de Dados no Weka - Regressão LinearJoão Gabriel Lima
 
Segurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoSegurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoJoão Gabriel Lima
 
Segurança na Internet - Google Hacking
Segurança na Internet - Google  HackingSegurança na Internet - Google  Hacking
Segurança na Internet - Google HackingJoão Gabriel Lima
 
Segurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisSegurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisJoão Gabriel Lima
 
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...João Gabriel Lima
 
Mineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoMineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoJoão Gabriel Lima
 
Mineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaMineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaJoão Gabriel Lima
 
Visualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideVisualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideJoão Gabriel Lima
 
REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?João Gabriel Lima
 
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...João Gabriel Lima
 
E-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosE-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosJoão Gabriel Lima
 
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.jsJoão Gabriel Lima
 
Hackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptHackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptJoão Gabriel Lima
 

Más de João Gabriel Lima (20)

Cooking with data
Cooking with dataCooking with data
Cooking with data
 
Deep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationDeep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer Segmentation
 
Aplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackAplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full Stack
 
Realidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitRealidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKit
 
JS - IA
JS - IAJS - IA
JS - IA
 
Big data e Inteligência Artificial
Big data e Inteligência ArtificialBig data e Inteligência Artificial
Big data e Inteligência Artificial
 
Mineração de Dados no Weka - Regressão Linear
Mineração de Dados no Weka -  Regressão LinearMineração de Dados no Weka -  Regressão Linear
Mineração de Dados no Weka - Regressão Linear
 
Segurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoSegurança na Internet - Estudos de caso
Segurança na Internet - Estudos de caso
 
Segurança na Internet - Google Hacking
Segurança na Internet - Google  HackingSegurança na Internet - Google  Hacking
Segurança na Internet - Google Hacking
 
Segurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisSegurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentais
 
Web Machine Learning
Web Machine LearningWeb Machine Learning
Web Machine Learning
 
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
 
Mineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoMineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - Clusterização
 
Mineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaMineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e Weka
 
Visualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideVisualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark side
 
REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?
 
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
 
E-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosE-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãos
 
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
 
Hackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptHackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com Javascript
 

Último

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Último (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Cassandra audio-video sensor fusion for aggression detection

  • 1. CASSANDRA: audio-video sensor fusion for aggression detection∗ W. Zajdel∗ J.D. Krijnders† T. Andringa† D.M. Gavrila∗ ∗ Intelligent Systems Laboratory, Faculty of Science, University of Amsterdam † Auditory Cognition Group, Artificial Intelligence, Rijksuniversiteit Groningen Abstract in this work we combine audio- and video-sensing. At the low level raw sensor data are processed to compute This paper presents a smart surveillance system named ‘intermediate-level” events or features that summarize ac- CASSANDRA, aimed at detecting instances of aggressive tivities in the scene. Examples of such descriptors imple- human behavior in public environments. A distinguishing mented in the current system are “scream” (audio) or “train aspect of CASSANDRA is the exploitation of the compli- passing” and “articulation energy” (video). At the top level, mentary nature of audio and video sensing to disambiguate a Dynamic Bayesian Network combines the visual and au- scene activity in real-life, noisy and dynamic environments. ditory events and incorporates any context-specific knowl- At the lower level, independent analysis of the audio and edge in order to produce an aggregate aggression indica- video streams yields intermediate descriptors of a scene tion. This is unlike previous work where audio-video fusion like: ”scream”, ”passing train” or ”articulation energy”. At dealt with speaker localization for advanced user-interfaces, the higher level, a Dynamic Bayesian Network is used as i.e. using video information to direct a phased-array micro- a fusion mechanism that produces an aggregate aggression phone configuration [8]. indication for the current scene. Our prototype system is validated on a set of scenarios performed by professional actors at an actual train station to ensure a realistic audio 2 System Description and video noise setting. 2.1 Audio Unit 1 Introduction Audio processing is performed in the time-frequency do- main with an approach common in auditory scene analy- Surveillance technology is increasingly fielded to help safe- sis [11]. The transformation of the time-signal to the time- guard public spaces such as train stations, shopping malls, frequency domain is performed by a model1 of the human street corners, in view of mounting concerns about public ear [3]. This model is a transmission-line model with its safety. Traditional surveillance systems require human op- channels tuned according to the oscillatory properties of the erators who monitor a wall of CCTV screens for specific basilar membrane. Leaky-integration of the squared mem- events that occur rarely. Advanced systems have the po- brane displacement results in an energy-spectrum, called a tential to automatically filter-out spurious information and cochleogram (see Fig. 1). present the operator only the security-relevant data. Ex- A signal component is defined as a coherent area of the isting systems have still limited capabilities; they typically cochleogram that is very likely to stem from a single source. perform video-based intrusion detection, and possibly some To obtain signal components, the cochleogram is first fil- trajectory analysis, in fairly static environments. tered with a matched filter which encodes the response of In the context of human activity recognition in dynamic the cochlea to a perfect sinusoid of the frequency applicable environments, we focus on the relatively unexplored prob- to that segment. Then the cochleogram is thresholded using lem of aggression detection. Earlier work involved solely a cut-off value based on two times the standard deviation of the video domain and considered fairly controlled in-door the energy values. Signal components are obtained as the environments with static background and few (two) per- tracks formed by McAulay-Quatari tracking [7], applied on sons [2]. Because events associated with the build-up or this pre-processed version of the cochleogram. This entails enactment of aggression are difficult to detect by a single stringing the energy maxima of connected components to- sensor modality (e.g. shouting versus hitting-someone), gether, over the successive frames. ∗ This research was supported by the Dutch Science Foundation NWO 1 We thank Sound Intelligence (http://www.soundintel.com) for con- under grant 634.000.432 within the ToKeN2000 program. tributing this model and cooperation on the aggression detection methods. 978-1-4244-1696-7/07/$25.00 ©2007 IEEE. 200
  • 2. principled approach for detecting aggression involves de- tailed body-pose estimation, possibly in 3D, followed by ballistic analysis of movements of body parts. Unfortu- nately, at present pose estimation remains a significant com- putational challenge. Various approaches [4] operate at lim- ited rates and handle mostly a single person in a constrained setting (limited occlusions, pre-fitted body model). Simplified approaches rely on a coarser representation human body. An example is a system [2] that tracks a head of a person by analyzing body contour and correlates ag- gression with head’s “jerk” (derivative of acceleration). In practice, high-order derivatives related to body contours are difficult to estimate robustly in cases where the background is not static and there is a possibility of occlusion. Figure 1: Typical cochleograms of aggressive and normal speech (top and bottom figure, respectively). Energy con- 2.2.1 Visual aggression features tent is color-coded (increasing from blue to red). Note the higher pitch (marked by white lines) and the more pro- Here we consider alternative cues based on an intuitive ob- nounced higher harmonics for the aggressive speech case. servation that aggressive behavior leads to highly energetic body articulation. We estimate (pseudo-) kinetic energy of body parts using a “bag-of-points” body model. The ap- Codeveloping sinusoids with a frequency development proach relies on simple image processing operations and equal to an integer multiple of a fundamental frequency yields features highly correlated with aggression. (harmonics) are subsequently combined into harmonic For detecting people our video subsystem employs adap- complexes. Note that these harmonics can be combined tive background/foreground subtraction technique [12]. safely because the probability is small that uncorrelated The assumption of static background scene holds fairly well sound sources show this measure of correlation by chance. in the center view area, where most of the people enter the Little or no literature exists on the influence of aggres- scene. After detection, people are represented as ellipses sion on the properties of speech. However the Component and tracked with an extended version of the mean-shift Process Model from Scherer [9] and similarities with the tracking algorithm [13]. The extended tracker adapts po- Lombard reflex [6] suggest a couple of important cues for sition and shape of the ellipse (tilt, axes) and thus facilitates aggression. The component process theory assumes that a close approximation of body area even for tilted/bended anger and panic, emotions strongly related to aggression, poses. Additionally, the mean-shift tracker handles well are seen as a ergotropic arousal. This form of arousal is partial occlusions and achieves near real-time performance. accompanied by an increase in heart frequency, blood pres- We consider human body as a collection of loosely con- sure, transpiration and associated hormonal activity. The nected points with identical mass. While such a model is predictions given by the model show many similarities with clearly a simplification, it reflects well the non-rigid nature the Lombard reflex.the vocal chords increases the pitch and of a body and facilitates fast computations. Assuming Q enhances the higher harmonics, which leads to an increase points attached to various body-parts, the average kinetic in spectral tilt. These properties, pitch (fundamental fre- energy of an articulating body is given by the average ki- quency (f0 )) and spectral tilt (a measure of the slope of the netic energy of points, average energy distribution, calculated as the the energy of the harmonics above 500 Hz divided by the energy of the Q 1 1 harmonics below 500 Hz) are calculated from the harmonic E= mi |vi − ve |2 (1) Q 2 complexes. The audio detector uses these two properties as i=1 input for a decision tree. An example of normal and aggres- where vi , mi denote, respectively, velocity vector and mass sive speech can be seen in Fig. 1. of the ith point, and ve denotes the velocity vector of the ellipse representing the person. By discounting overall body 2.2 Video Unit motion we capture only the articulation energy. Analysis of the video stream aims primarily at computing Due to the assumption of uniform mass distribution be- visual cues characteristic for physical aggression among hu- tween points, the total body mass becomes a scale factor. mans. Physical aggression is usually characterized by fast By omitting scale factors, we obtain a pseudo-kinetic en- articulation of body parts (i.e. arms, legs). Therefore, a ¯ 1 Q ergy estimate in the form E = Q i=1 |vi − ve |2 . Such 201
  • 3. Given the noisy and ambiguous domain we resort to a probabilistic formulation. The fusion unit employs a proba- bilistic time-series model (a Dynamic Bayesian Network, DBN [5]), where aggression level can be estimated in a principled way by solving appropriate inference problem. 2.3.1 Basic model We denote the discrete-time index as k = 1, 2, . . ., and set Figure 2: (Left) Optical-flow features for detecting trains in the gap between discrete-time steps (clock ticks) to 50 ms. motion. (Right) Representing people: ellipses (for tracking) a At the kth step, yk ∈ {0, 1} denotes the output of audio and points (for articulation features). v aggression detector, and yj,k denotes the pseudo-kinetic en- ergy of the jth, j = 1, . . . , J, person. Our system can com- features are assumed to measure articulation and will be our prise several train detectors monitoring non-overlapping rail primary visual cues for aggression detection. sections. The binary output of the mth, m = 1, . . . , 4 = M , T Computation of energy features requires selecting image train detector will be denoted as ym,k ∈ {0, 1}. (We tested points that represent a person. Ideally, the points would a configuration with M = 4.) cover the limbs and the head since these parts are mostly ar- Our aim is to detect ”ambient” scene aggression, without ticulated. Further, to estimate velocities, the selected points deciding precisely which persons are aggressive. Therefore must be easy to track. Accordingly, we select Q = 100 we reason on the basis of a cumulative articulation mea- v v points within an extended bounding box of a person by find- surement yk = j yj,k over all persons. Additionally, the ing pixels with the most local contrast [10]. Such points are cumulative is quite robust to (near-)occlusions when articu- easy to track and usually align well with edges in an image lation of one person could be wrongly attributed to another. (which in turn often coincide with limbs as in Fig. 2, right). In order to reason about aggression level, we use a 5-step For point tracking we use the KLT algorithm [10] (freely discrete scale 0, 1 : 0.0 (no activity), 0.2 (normal activity), available implementation from the OpenCV library). 0.4 (attention required), 0.6 (minor disturbance), 0.8 (major disturbance), and 1.0 (critical aggression). 2.2.2 Train detection Importantly, the aggression level obeys specific corre- lations over time and should be represented as a process An additional objective of the video unit is detecting trains (rather than an instantaneous quantity). We will denote ag- in motion. Trains moving in and out of a station produce gression level at step k as ak and define a stochastic process auditory noise that often leads to spurious audio-aggression {ak } with dynamics given by a 1st order model: detections. Therefore recognizing trains in video opens a possibility for later suppressing of such detections. p(ak+1 = i|ak = j) = CPTa (i, j), (2) A train usually appears as a large, rigid body and moves along a constrained trajectory. For a given view and rail where CPTa (i, j), denotes a conditional probability table. section we define a mask that indicates the image regions In a sense, the first order model is a simplification as it cap- where a train typically appears. In this region we track tures only short-term dependencies. frame-to-frame motion of N = 100 image features with v a The measured visual (yk ) and auditory (yk ) features are KLT [10] tracker (Fig. 2, left). The features’ motion vec- treated as samples from an observation distribution (model) tors are classified as train/non-train by a pre-trained nearest- that depends on the aggression level ak . Since (later on) we neighbor classifier. A train in motion is detected when more will incorporate information about passing trains, we intro- then 50% of the features are classified positively. Due to the duce a latent train-noise indicator variable nk ∈ {0, 1} and constrained movement of trains, our detector turns out quite assume that the observation model robust to occasional occlusions of the train area by people. v a p(yk , yk |ak , nk ) 2.3 Fusion Unit depends also on the train-noise indicator. The model takes The fusion unit produces an aggregate aggression indication the form of a conditional probability table CPTo , where the given the features/events produced independently by the au- cumulative articulation feature is discretized. dio and video subsystems. A fundamental difficulty with fu- sion arises from inevitable ambiguities in human behavior 2.3.2 Train models which make it difficult to separate normal from aggressive activities (even for a human observer). Additional problems The fusion DBN comprises several subnetworks — train T follow from various noise artifacts in the sensory data. models which couple train detections ym,k with the latent 202
  • 4. train-noise indicator nk . Additionally, each train model en- T ym,k T ym,k+1 M codes prior information about duration of a train pass. For the mth rail section, we introduce a latent indicator ··· τm,k τm,k+1 ··· im,k ∈ {0, 1} of a train passing at step k. We assume that T the train detections ym,k , the train-pass indicators im,k , and the train noise nk obey a probabilistic relation ··· im,k im,k+1 ··· T T p(ym,k |im,k ) = CPTt (ym,k , im,k ) (3) p(nk |i1:M,k ) = CPTn (nk , i1:M,k ). (4) nk nk+1 For each rail, the model (3) encodes inaccuracies of detector ··· ak ak+1 ··· (mis-detections, false alarms). The model (4) represents the fact that passing trains usually induce noise, but also that yk yk+1 sometimes noise is present without a passing train. Since a typical pass takes 5 − 10 seconds (100 − 200 Figure 3: Dynamic Bayesian Network representing the steps) the pass indicator variable exhibits strong temporal probabilistic fusion model. The rectangular plate indicates correlations. We represent such correlations with a time- M = 4 replications of the train sub-network. series model based on a gamma distribution. A gamma pdf γ(τm , αm , βm ) is a convenient choice for modeling dura- tion τm of an event (αm , βm are parameters). To apply this Given the graphical structure of the model (Fig. 3), model in a time-series formulation, we replace the total du- the required distribution can be efficiently computed us- ration τm with a partial duration τm,k that indicates how ing a recursive, forward filtering procedure [5]. We im- long a train is already passing a scene at step k. plemented an approximate variant of the filtering proce- By considering a joint process {im,k , τm,k } temporal dure, known as the Boyen-Koller algorithm [1]. At a given correlations can be enforced by the following model step k, the algorithm maintains only marginal distributions v a t p(hk |y1:k , y1:k , y1:m,1:k ), where hk is any of the latent vari- p(im,k+1 = 1|τm,k , im,k = 0) = ηm ables. When new detector data arrive, the current-step p(im,k+1 = 1|τm,k , im,k = 1) = p(τm > τm,k ) = marginals are updated to represent the next-step marginals. +∞ An important modeling aspect are temporal develop- = γ(τm , αm , βm )dτm = 1 − F (τm,k , αm , βm ), ments of processes in the scene. Unlike the binary train- τm,k pass events, aggression level usually undergoes more sub- tle evolutions as the tension and anger among people build where F () is a gamma cumulative density function. Param- up. Since the assumed (1st-order) model might not capture eter ηm denotes a probability of starting a new train pass. well long-term effects and a stronger model would be rather At the kth step, the probability of continuing a pass is func- complicated we enforce temporal correlations with a simple tion of the current duration of the pass. A configuration low-pass filter. The articulation measurements (before in- (im,k+1 = 1, τm,k , im,k = 1) implies that a pass does not ference) and the expected aggression level (after inference) finish yet and the total pass duration will be larger than τm,k , are low-pass filtered using a 10 s running-average filter. hence the integration. Further, the partial duration variable The parameters of our model: probability tables: CPTa , obeys a deterministic regime CPTo , CPTn , CPTt and the parameters αm , βm of the 0 iff im,k+1 = 0 gamma pdf’s) are estimated by maximum-likelihood learn- τm,k+1 = , ing. The learning process relies on detector measurements τm,k+1 = τm,k + ǫ otherwise from training audio-video clips and ground-truth annota- where ǫ = 50 ms is the period between successive steps. tions. The annotations are particularly important for learn- ing the observation model CPTo . An increased probability 2.3.3 Inference and Learning of a false auditory aggression in the presence of train noise, will suppress the contribution of audio data to the aggres- In a probabilistic framework, reasoning about aggression sion level when the video subsystem reports a passing train. corresponds to solving probabilistic inference problems. In an online mode, the key quantity of interest is the posterior distribution on aggression level given data collected at up 3 Experiments v a t to the current step, p(ak |y1:k , y1:k , y1:m,1:k ). From this dis- tribution we calculate the expected aggression value, which We evaluate the aggression detection system using a set of will be the basic output of the fusion system. 13 audio-video clips (scenarios) recorded at a train station. 203
  • 5. audio−aggression detections 1 binary value 0.5 0 0 20 40 60 80 100 120 140 160 visual agrression features raw pseudo−energy 1000 low−pass 500 0 0 20 40 60 80 100 120 140 160 posterior expected value of aggression level Figure 4: A screen-shot of the CASSANDRA prototype 1 aggression scale system. In the lower part of the right window, various in- 0.5 ground−truth termediate quantities are shown: the probabilities for trains low−pass on various tracks (currently zero), the output of the video- raw 0 based articulation energy (”People”, currently mid-way), 0 20 40 60 80 100 120 140 160 the output of the audio detector (currently at maximum). time [s] The slider covering the right side shows the overall esti- mated aggression level (currently ”major disturbance”). The clips (each 100s − 150s) feature 2-4 professional ac- tors who engage in a variety of activities ranging from nor- mal (walking) through slightly excited (shouting, running, hugging), moderate aggressive (pushing, hitting a vending machine) to critically aggressive (football-supporters clash- ing). The recording took place at a platform of an actual train station (between two rail tracks, partially outdoor) and therefore incorporates realistic artifacts, like noise and vi- brations from trains, variable illumination, wind, etc. Scenarios have been manually annotated with a ground- truth aggression level by two independent observers using Figure 5: Aggression build-up scenario. (Top graph) the scale mentioned in Sect. 2.3.1. Aggression toward ob- Audio-aggression detections. (Middle graph) Visual ar- jects was rated approx. 25% lower than aggression toward ticulation measurements. (Bottom graph) Estimated and humans, i.e. the former did not exceed a level of 0.8. ground-truth aggression level. The gray lines show un- Fig. 5 details the results of the CASSANDRA system certainty intervals (2× std. deviation) around raw (before on a scenario involving gradual aggression build-up. Here, filtering) expected level. (Images) Several frames (times- two pairs of competing supporters first start arguing, then tamps: 45 s, 77 s, 91 s, 95 s). Notice correspondence with get in a fight. The bottom panel shows some illustrative the articulation measurements. frames, with articulation features highlighted. We see from Fig. 5 that the raw (before low-pass filtering) articulation measurements are rather noisy, however the low-pass filter- ing reveals strong correlation with ground-truth aggression level. The effect of low-pass filtering of the estimated ag- gression level is shown bottom plot of Fig. 5. A screen-shot of the CASSANDRA system in action is shown in Fig. 4. We considered two quantitative criteria to evaluate our system. The first is the deviation of the CASSANDRA es- timated aggression level from the ground-truth annotation. Figure 6: Selected frames from a scenario involving aggres- Here we obtained a deviation of mean 0.17 with standard sion toward a machine. 204
  • 6. ground-truth detected events Cumulative system performace id scenario content positive true-pos. false-pos. 80 1 normal: walking, greeting 0 0 0 fusion 2 normal: walking, greeting 0 0 1 3 excited: lively argument 0 0 0 70 4 excited: lively argument 1 1 0 detection % video 5 aggression toward a vend. machine 1 0 1 6 aggression toward a vend. machine 1 0 0 60 7 happy football supporters 1 1 0 8 happy football supporters 0 0 1 9 supporters harassing a passenger 1 1 0 50 10 supporters harassing a passenger 1 1 0 11 two people fight, third intervenes 1 1 0 audio 12 four people fighting 1 1 0 40 13 four people fighting 1 1 1 0 5 10 15 20 false−positives / hour Table 1: Aggression detection results by scenario. The table indicates number of events (positive=aggressive). Figure 6 Figure 7: Cumulative detection results by sensor modality. shows example frames from the 5th scenario. deviation 0.1. The second performance criterion considers realtime. aggression detection as a two-class classification problem of The present system is able to distinguish well between distinguishing between ”normal” and ”aggressive” events clear cases of aggression and normal behavior. In the future, (by thresholding aggression level at 0.5). Matching ground- we plan to focus increasingly on the ”borderline” cases. For truth with estimated events allows us to compute detection this, we expect to use more elaborate auditory cues (laugh- rate (%) and false-alarm rate (per hour). When matching ter vs scream), more detailed visual cues (indications of events we allowed a time deviation of 10 s. The cumula- body-contact, partial body-pose estimation), and stronger use of context information. tive results of leave-one-out tests on 13 scenarios (12 for training, 1 for testing) are given in Fig. 7. Comparing the test results for three modalities (audio, video, fusion of au- References dio+video), we notice that the auditory and visual features [1] X. Boyen and D. Koller. Tractable inference for complex indeed are complimentary; with fusion the overall detection stochastic processes. In UAI, 1998. rate increased without introducing additional false alarms. [2] A. Datta et al. Person-on-person violence detection in video It is important to note that our data set is heavily biased data. ICPR, 01:10433, 2002. toward occurrences of aggression, i.e. which put the sys- [3] H. Duifhuis et al. Cochlear Mechanisms: Structure, Func- tem to a difficult test. We expect CASSANDRA to pro- tion and Models, pages 395–404. 1985. duce much less false alarms in a typical surveillance setting, [4] D.M. Gavrila. The Visual analysis of human movement: A where most of the time nothing happens. survey. CVIU, 73(1):82–98, 1999. Table 1 gives an overview of the detection results on the [5] Z. Ghahramani. An introduction to hidden Markov models scenarios. We notice that the system performed well on and Bayesian networks. IJPRAI, 15(1):9–42, 2001. the clearly normal cases (scenarios 1-3) or aggressive cases (scenarios 9-13), while borderline scenarios were more dif- [6] J.-C. Junqua. The lombard reflex and its role on human lis- teners and automatic speech recognizers. The Journal of the ficult to classify. The borderline behavior (e.g. scenarios Acoustical Society of America, 93(1):510–524, 1 1993. 7-8) turns out also difficult to classify for human observers given the inconsistent ground-truth annotation in Tab. 1. [7] R. McAulay and T. Quatieri. Speech analysis / synthesis based on a sinusoidal representation. Acoustics, Speech, and The CASSANDRA system runs on two PCs (one with Signal Processing, 34(4):744–754, 1986. the video and fusion units, the other with the audio unit). [8] A. Pentland. Smart rooms. Scientific American, 274(4):54– The overall processing rate is approx. 5 Hz with 756×560× 62, 1996. 20 Hz input video stream and 44 kHz input audio stream. [9] K. Scherer. Vocal affect expression: A review and a model for future research. Psychol. Bulletin, 99(2):143–165, 1986. 4 Conclusions and Future Work [10] J. Shi and C. Tomasi. Good features to track. In CVPR, 1994. We demonstrated a prototype system that uses a Dynamical [11] D. Wang and G. Brown. Computational Auditory Scene Bayesian Network to fuse auditory and visual information Analysis. John Wiley and Sons, 2006. for detecting aggressive behavior. On the auditory side, the [12] Z. Zivkovic. Improved adaptive gausian mixture model for system relies on scream-like cues, and on the video side, background subtraction. In ICPR, 2004. the system uses motion features related to articulation. We [13] Z. Zivkovic and B. Krose. An EM-like algorithm for color- obtained a promising aggression detection performance in a histogram-based object tracking. In CVPR, 2004. complex, real-world train station setting, operating in near- 205