SlideShare una empresa de Scribd logo
1 de 18
Title of presentation
Subtitle
Name of presenter
Date
Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation
Using a New Frame Selection Policy and Gating Mechanism
Nikolaos Gkalelis, Dimitrios Daskalakis, Vasileios Mezaris
CERTH-ITI, Thermi - Thessaloniki, Greece
IEEE Int. Symposium on Multimedia,
Naples, Italy, Dec. 2022
2
• The recognition of high-level events in unconstrained video is an important topic
with applications in: security (e.g. “making a bomb”), automotive industry (e.g.
“pedestrian crossing the street”), etc.
• Most approaches are top-down: “patchify” the frame (context agnostic); use
label and loss function to learn focusing on frame regions related with event
• Bottom-up approaches: use an object detector, feature extractor and graph
network to extract and process features from the main objects in the video
Introduction
Video event
“walking the dog”
3
• Our recent bottom-up approach with SOTA performance in many datasets
• Uses a graph attention network (GAT) head to process local (object) & global
(frame) information
• Also provides frame/object-level explanations (in contrast to top-down ones)
Video event
“removing ice from
car” miscategorized
as “shoveling snow”
Object-level
explanation:
classifier does
not focus on the
car object
ViGAT
4
• Cornerstone of ViGAT head; transforms a feature matrix (representing graph’s
nodes) to a feature vector (representing the whole graph)
• Computes explanation significance (weighted in-degrees, WiDs) of each node
using the graph’s adjacency matrix
Attention
Mechanism
GAT head Graph pooling
X (K x F) A (K x K) Ζ (K x F) η (1 x F)
𝝋𝒍 =
𝒌=𝟏
𝑲
𝒂𝒌,𝒍 , 𝒍 = 𝟏, … , 𝑲
Computation of
Attention matrix from
node features; and
Adjacency Matrix using
attention coefficients
Multiplication of
node features with
Adjacency Matrix
Production of vector-
representation of the graph
WiDs: Explanation
significance of l-th node
ViGAT block
ω2
ω2
5
K
K objects
object-level
features
b
frame-level
local features
P
ω2
P
P
P
ω3
b
frame-level
global features
P
ω1 concat u
video
feature
o
video frames
video-level
global feature
mean
video-level
local feature
K
frame WiDs
(local info)
frame WiDs
(global info)
object WiDs
P
P
P
Recognized Event: Playing
beach volleyball!
Explanation: Event supporting
frames and objects
ViGAT architecture
max3
max
o: object detector
b: feature extractor
u: classification head
GAT blocks: ω1, ω2, ω3
Global branch: ω1
Local branch: ω2, ω3
Local information
Global information
6
• ViGAT has high computational cost due to local (object) information processing
(e.g.,P=120 frames, K=50 objects per frame, PK=6000 objects/video)
• Efficient video processing has investigated at the top-down (frame) paradigm:
- Frame selection policy: identify most important frames for classification
- Gating component: stop processing frames when sufficient evidence achieved
• Unexplored topic in bottom-up paradigm: Can we use such techniques to reduce
the computational complexity in the local processing pipeline of ViGAT?
ViGAT
7
K
b
P
Q
ω3
concat
u
video
feature
o
Extracted
video frames
mean
video-level
local feature
K
Frame WiDs
(local info)
Object WiDs
(local info)
Q(s)
Frame selection
policy
Q(s) Q(s)
Q(s)
Q(s)
Q(s)
g(s)
ON/OFF
concat
max
Explanation: Event supporting
frames and objects
Recognized Event: Playing
beach volleyball!
Computed
video-level
global feature
Computed
frame WiDs
(global info)
u1 uP
max3
Gate is closed: Request Q(s+1) - Q(s) additional frames
ζ(s)
ζ(1)
g(1)
g(S)
Z(s)
Gated-ViGAT
ω2
ω2
ω2
Local information processing pipeline
8
• Iterative algorithm to select Q frames
frame-level
global features
frame WiDs
(global info)
argmax
p1
minmax
minmax
αp = (1/2) (1 – γp
Τγpi-1
)
γp = γp /|γp|
γ1
γP
uP
u1
uP
u1
α1 αP
pi
argmax
u1 uP
up = αp up
u1 uP
α1 αP
1. Initialize
2. Select Q-1 frames
Input: Q, frame index p1, P feature vectors
Iterate for i= 2 to Q-1
γ1
γP
Gated-ViGAT: Frame selection policy
9
• Each gate has a GAT block-like structure and binary classification
head (open/close); corresponds to specified number of frames Q(s);
trained to provide 1 (i.e. open) when ViGAT loss is low; design
hyperparameters:Q(s) , β (sensitivity)
Use frame selection policy to select Q(s) frames for gate g(s)
Compute the video-level local feature ζ(s) (and Z(s))
Compute ViGAT classification loss: lce = CE(label, y)
Derive pseudolabel 0(s) : 1 if lce <= βes/2 ; zero otherwise
Compute gate component loss: 𝐿 =
1
𝑆 𝑠=1
𝑆
𝑙𝑏𝑐𝑒(𝑔 𝑠
𝒁 𝑠
, 𝑜(𝑠)
)
Perform backpropagation to update gate weights
concat
u
video
feature
video-level
local feature
g(s)
concat
Computed
video-level
global feature
ζ(s)
ζ(1)
g(1)
g(S)
Local ViGAT
branch
Z(s)
Ground truth
label
cross
entropy
y
Binary cross
entropy
o(s)
Gated-ViGAT: Gate training
Select Q(s) video
frames for gate g(s)
Q
o
10
• ActivityNet v1.3: 200 events/actions, 10K/5K training/testing, 5 to 10 mins; multilabel
• MiniKinetics: 200 events/actions, 80K/5K training/testing, 10 secs duration; single-label
• Video representation: 120/30 frames with uniform sampling for ActivityNet/MiniKinetics
• Pretrained ViGAT components: Faster R-CNN (pretrained/finetuned on Imagenet1K/VG, K=50
objects), ViT-B/16 backbone (pretrained/finetuned on Imagenet11K/Imagenet1K), 3 GAT blocks
(pretrained on the respective dataset, i.e., ActivityNet or MiniKinetics)
• Gates: S= 6 / 5 (number of gates), {Q(s)} = {9, 12, 16, 20, 25, 30} / {2, 4, 6, 8, 10} (sequence lengths),
for ActivityNet/MiniKinetics
• Gate training hyperparameters: β = 10-8, epochs= 40, lr = 10-4 multiplied with 0.1 at epochs 16, 35
• Evaluation Measures: mAP (ActivityNet), top-1 accuracy (MiniKinetics), FLOPs
• Gated-ViGAT is compared against top-scoring methods in the two datasets
Experiments
11
Methods in MiniKinetics Top-1%
TBN [30] 69.5
BAT [7] 70.6
MARS (3D ResNet) [31] 72.8
Fast-S3D (Inception) [14] 78
ATFR (X3D-S) [18] 78
ATFR (R(2+1D)) [18] 78.2
RMS (SlowOnly) [28] 78.6
ATFR (I3D) [18] 78.8
Ada3D (I3D, Kinetics) [32] 79.2
ATFR (3D Resnet) [18] 79.3
CGNL (Modified ResNet) [17] 79.5
TCPNet (ResNet, Kinetics) [3] 80.7
LgNet (R3D) [3] 80.9
FrameExit (EfficientNet) [1] 75.3
ViGAT [9] 82.1
Gated-ViGAT (proposed) 81.3
• Gated-ViGAT outperforms all top-down approaches
• Slightly underperforms ViGAT, but approx. 4 and 5.5 FLOPs reduction
• As expected, has higher computational complexity than many top-down
approaches (e.g. see [3], [4]) but can provide explanations
Methods in ActivityNet mAP%
AdaFrame [21] 71.5
ListenToLook [23] 72.3
LiteEval [33] 72.7
SCSampler [25] 72.9
AR-Net [13] 73.8
FrameExit [1] 77.3
AR-Net (EfficientNet) [13] 79.7
MARL (ResNet, Kinetics) [22] 82.9
FrameExit (X3D-S) [1] 87.4
ViGAT [9] 88.1
Gated-ViGAT (proposed) 87.3
FLOPS in 2 datasets ViGAT Gated-ViGAT
ActivityNet 137.4 24.8
MiniKinetics 34.4 8.7
Experiments: results
*Best and second best performance
are denoted with bold and underline
12
• Computed # of videos processed and recognition performance for each gate
• Average number of frames for ActivityNet / MiniKinetics: 20 / 7
• Recognition rate drops with gate number increase; this behavior is more
clearly shown in ActivityNet (longer videos)
• Conclusion: more “easy” videos exit early, more “difficult” videos still difficult
to recognize even with many frames (similar conclusion with [1])
ActivityNet g(1) g(2) g(3) g(4) g(5) g(6)
# frames 9 12 16 20 25 30
# videos 793 651 722 502 535 1722
mAP% 99.8 94.5 93.8 92.7 86 71.6
MiniKinetics g(1) g(2) g(3) g(4) g(5)
# frames 2 4 6 8 10
# videos 179 686 1199 458 2477
Top-1% 84.9 83 81.1 84.9 80.7
Experiments: method insight
13
• Bullfighting (top) and Cricket (bottom) test videos of ActivityNet exited at first
gate, i.e., recognized using only 9 frames out of 120 (required with ViGAT)
• Frames selected with the proposed policy, both explain recognition result and
provide diverse view of the video: help to recognize video with fewer frames
Bullfighting
Cricket
Experiments: examples
14
• Can also provide explanations at object-level (in contrast to top-down methods)
“Waterskiing” predicted
as “Making a sandwich”
“Playing accordion” predicted
as “Playing guitarra”
“Breakdancing” (correct prediction)
Experiments: examples
15
Policy / #frames 10 20 30
Random 83 85.5 86.5
WiD-based 84.9 86.1 86.9
Random on local 85.4 86.6 86.9
WiD-based on local 86.6 87.1 87.5
FrameExit policy 86.2 87.3 87.5
Proposed policy 86.7 87.3 87.6
Gated-ViGAT (proposed) 86.8 87.5 87.7
Experiments: ablation study on frame selection policies
• Comparison (mAP%) on ActivityNet
• Gated-ViGAT selects diverse frames with high explanation potential
• Proposed policy is second best (surpassing FrameExit [1], current SOTA)
Random: Θ frames selected randomly for local/global features
WiD-Based: Θ frames are selected using global WiDs
Random local: P frames derive global feature; Θ frames selected randomly
WiD-based local: P frames derive global feature; Θ frames using global WiDs
FrameExit policy: Θ frames are selected using policy in [1]
Proposed policy: P frames derive global feature; Θ selected using proposed
Gated-ViGAT: in addition to above gate component selects Θ frames in average
16
• Top-6 frames of “bungee jumping” video selected with WiD-based vs proposed policy
Proposed
WiD-based
Updated
WiDs
Experiments: ablation study example
17
• An efficient bottom-up event recognition and explanation approach presented
• Utilizes a new policy algorithm to select frames that: a) explain best the
classifier’s decision, b) provide diverse information of the underlying event
• Utilizes a gating mechanism to instruct the model to stop extracting bottom-
up (object) information when sufficient evidence of the event is achieved
• Evaluation on 2 datasets provided competitive recognition performance and
approx. 5 times FLOPs reduction in comparison to previous SOTA
• Future work: investigations for further efficiency improvements, e.g.: faster
object detector, feature extractor, frame selection also for the global
information pipeline, etc.
Conclusions
18
Thank you for your attention!
Questions?
Nikolaos Gkalelis, gkalelis@iti.gr
Vasileios Mezaris, bmezaris@iti.gr
Code publicly available at:
https://github.com/bmezaris/Gated-ViGAT
This work was supported by the EUs Horizon 2020 research and innovation programme under grant
agreement 101021866 CRiTERIA

Más contenido relacionado

La actualidad más candente

Centralité urbaine
Centralité urbaineCentralité urbaine
Centralité urbaine
Sami Sahli
 
Design and Development of BIM on GIS Interoperability Open Platform
Design and Development of BIM on GIS Interoperability Open PlatformDesign and Development of BIM on GIS Interoperability Open Platform
Design and Development of BIM on GIS Interoperability Open Platform
slhead1
 
Villa urbaine durable 02
Villa urbaine durable 02Villa urbaine durable 02
Villa urbaine durable 02
Sami Sahli
 

La actualidad más candente (20)

Conservation et réhabilitation de la kasbah de Taourirt
Conservation et réhabilitation de la kasbah de TaourirtConservation et réhabilitation de la kasbah de Taourirt
Conservation et réhabilitation de la kasbah de Taourirt
 
Deep Convolutional Neural Networks and Covid19 by Dr.Sana Komal
Deep Convolutional Neural Networks and Covid19 by Dr.Sana KomalDeep Convolutional Neural Networks and Covid19 by Dr.Sana Komal
Deep Convolutional Neural Networks and Covid19 by Dr.Sana Komal
 
Beyond PMP: Risk Management
Beyond PMP: Risk ManagementBeyond PMP: Risk Management
Beyond PMP: Risk Management
 
Brochure Bonne Gouvernance et Anti-Corruption en Tunisie sept 2019
Brochure Bonne Gouvernance et Anti-Corruption en Tunisie sept 2019Brochure Bonne Gouvernance et Anti-Corruption en Tunisie sept 2019
Brochure Bonne Gouvernance et Anti-Corruption en Tunisie sept 2019
 
Centralité urbaine
Centralité urbaineCentralité urbaine
Centralité urbaine
 
BIM process adoption for integrated design and constuction
BIM process adoption for integrated design and constuctionBIM process adoption for integrated design and constuction
BIM process adoption for integrated design and constuction
 
New solutions of bim
New solutions of bimNew solutions of bim
New solutions of bim
 
Smart cities presentation
Smart cities presentationSmart cities presentation
Smart cities presentation
 
Chest X-ray Pneumonia Classification with Deep Learning
Chest X-ray Pneumonia Classification with Deep LearningChest X-ray Pneumonia Classification with Deep Learning
Chest X-ray Pneumonia Classification with Deep Learning
 
Smart cities and its features
Smart cities and its featuresSmart cities and its features
Smart cities and its features
 
Design and Development of BIM on GIS Interoperability Open Platform
Design and Development of BIM on GIS Interoperability Open PlatformDesign and Development of BIM on GIS Interoperability Open Platform
Design and Development of BIM on GIS Interoperability Open Platform
 
SMART-CITIES : Enjeux sociaux, techniques et économiques
SMART-CITIES :  Enjeux sociaux, techniques et économiquesSMART-CITIES :  Enjeux sociaux, techniques et économiques
SMART-CITIES : Enjeux sociaux, techniques et économiques
 
Le centre de George pompidou
Le centre de George pompidou Le centre de George pompidou
Le centre de George pompidou
 
Chapter 11 personnel_and_security
Chapter 11 personnel_and_securityChapter 11 personnel_and_security
Chapter 11 personnel_and_security
 
BIM 101 : Initiation au BIM et ses technologies
BIM 101 : Initiation au BIM et ses technologiesBIM 101 : Initiation au BIM et ses technologies
BIM 101 : Initiation au BIM et ses technologies
 
Yii framework
Yii frameworkYii framework
Yii framework
 
Villa urbaine durable 02
Villa urbaine durable 02Villa urbaine durable 02
Villa urbaine durable 02
 
Detection of Malarial Parasite in Blood Using Image Processing
Detection of Malarial Parasite in Blood Using Image ProcessingDetection of Malarial Parasite in Blood Using Image Processing
Detection of Malarial Parasite in Blood Using Image Processing
 
Lotir
LotirLotir
Lotir
 
MUAT : Résilience aménagement et urbanisme
MUAT : Résilience aménagement et urbanismeMUAT : Résilience aménagement et urbanisme
MUAT : Résilience aménagement et urbanisme
 

Similar a Gated-ViGAT

Video Conferencing Experiences with UltraGrid:
Video Conferencing Experiences with UltraGrid: Video Conferencing Experiences with UltraGrid:
Video Conferencing Experiences with UltraGrid:
Videoguy
 
Video Conferencing Experiences with UltraGrid:
Video Conferencing Experiences with UltraGrid: Video Conferencing Experiences with UltraGrid:
Video Conferencing Experiences with UltraGrid:
Videoguy
 
BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentation
lilyco
 
Explaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attention
VasileiosMezaris
 

Similar a Gated-ViGAT (20)

Arctic Climatology Sensor Network
Arctic Climatology Sensor NetworkArctic Climatology Sensor Network
Arctic Climatology Sensor Network
 
Pawach job record.pptx
Pawach job record.pptxPawach job record.pptx
Pawach job record.pptx
 
Video Conferencing Experiences with UltraGrid:
Video Conferencing Experiences with UltraGrid: Video Conferencing Experiences with UltraGrid:
Video Conferencing Experiences with UltraGrid:
 
Video Conferencing Experiences with UltraGrid:
Video Conferencing Experiences with UltraGrid: Video Conferencing Experiences with UltraGrid:
Video Conferencing Experiences with UltraGrid:
 
State of GeoServer 2.10
State of GeoServer 2.10State of GeoServer 2.10
State of GeoServer 2.10
 
BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentation
 
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
 
Efficient video perception through AI
Efficient video perception through AIEfficient video perception through AI
Efficient video perception through AI
 
"Implementing Histogram of Oriented Gradients on a Parallel Vision Processor,...
"Implementing Histogram of Oriented Gradients on a Parallel Vision Processor,..."Implementing Histogram of Oriented Gradients on a Parallel Vision Processor,...
"Implementing Histogram of Oriented Gradients on a Parallel Vision Processor,...
 
State of GeoServer - FOSS4G 2016
State of GeoServer - FOSS4G 2016State of GeoServer - FOSS4G 2016
State of GeoServer - FOSS4G 2016
 
Extending the life of your device (firmware updates over LoRa) - LoRa AMM
Extending the life of your device (firmware updates over LoRa) - LoRa AMMExtending the life of your device (firmware updates over LoRa) - LoRa AMM
Extending the life of your device (firmware updates over LoRa) - LoRa AMM
 
NGIoT standardisation workshops_Jens Hagemeyer presentation
NGIoT standardisation workshops_Jens Hagemeyer presentationNGIoT standardisation workshops_Jens Hagemeyer presentation
NGIoT standardisation workshops_Jens Hagemeyer presentation
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
State of GeoServer
State of GeoServerState of GeoServer
State of GeoServer
 
Tech 2 Tech: Network performance
Tech 2 Tech: Network performanceTech 2 Tech: Network performance
Tech 2 Tech: Network performance
 
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati..."Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
 
Explaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attention
 
Introduction to Fog
Introduction to FogIntroduction to Fog
Introduction to Fog
 
Cisco Multi-Service FAN Solution
Cisco Multi-Service FAN SolutionCisco Multi-Service FAN Solution
Cisco Multi-Service FAN Solution
 
Nobuya Okada presentation
Nobuya Okada presentationNobuya Okada presentation
Nobuya Okada presentation
 

Más de VasileiosMezaris

Más de VasileiosMezaris (20)

Multi-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and LocalizationMulti-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and Localization
 
CERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages TaskCERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages Task
 
Spatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees VideosSpatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees Videos
 
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
 
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
 
TAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for ExplanationsTAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for Explanations
 
Combining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchCombining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video Search
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiers
 
Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...
 
Are all combinations equal? Combining textual and visual features with multi...
Are all combinations equal?  Combining textual and visual features with multi...Are all combinations equal?  Combining textual and visual features with multi...
Are all combinations equal? Combining textual and visual features with multi...
 
CA-SUM Video Summarization
CA-SUM Video SummarizationCA-SUM Video Summarization
CA-SUM Video Summarization
 
Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web application
 
PGL SUM Video Summarization
PGL SUM Video SummarizationPGL SUM Video Summarization
PGL SUM Video Summarization
 
Video Thumbnail Selector
Video Thumbnail SelectorVideo Thumbnail Selector
Video Thumbnail Selector
 
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal RetrievalHard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
 
Misinformation on the internet: Video and AI
Misinformation on the internet: Video and AIMisinformation on the internet: Video and AI
Misinformation on the internet: Video and AI
 
LSTM Structured Pruning
LSTM Structured PruningLSTM Structured Pruning
LSTM Structured Pruning
 
PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020
 
GAN-based video summarization
GAN-based video summarizationGAN-based video summarization
GAN-based video summarization
 
Migration-related video retrieval
Migration-related video retrievalMigration-related video retrieval
Migration-related video retrieval
 

Último

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 

Último (20)

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening Designs
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 

Gated-ViGAT

  • 1. Title of presentation Subtitle Name of presenter Date Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation Using a New Frame Selection Policy and Gating Mechanism Nikolaos Gkalelis, Dimitrios Daskalakis, Vasileios Mezaris CERTH-ITI, Thermi - Thessaloniki, Greece IEEE Int. Symposium on Multimedia, Naples, Italy, Dec. 2022
  • 2. 2 • The recognition of high-level events in unconstrained video is an important topic with applications in: security (e.g. “making a bomb”), automotive industry (e.g. “pedestrian crossing the street”), etc. • Most approaches are top-down: “patchify” the frame (context agnostic); use label and loss function to learn focusing on frame regions related with event • Bottom-up approaches: use an object detector, feature extractor and graph network to extract and process features from the main objects in the video Introduction Video event “walking the dog”
  • 3. 3 • Our recent bottom-up approach with SOTA performance in many datasets • Uses a graph attention network (GAT) head to process local (object) & global (frame) information • Also provides frame/object-level explanations (in contrast to top-down ones) Video event “removing ice from car” miscategorized as “shoveling snow” Object-level explanation: classifier does not focus on the car object ViGAT
  • 4. 4 • Cornerstone of ViGAT head; transforms a feature matrix (representing graph’s nodes) to a feature vector (representing the whole graph) • Computes explanation significance (weighted in-degrees, WiDs) of each node using the graph’s adjacency matrix Attention Mechanism GAT head Graph pooling X (K x F) A (K x K) Ζ (K x F) η (1 x F) 𝝋𝒍 = 𝒌=𝟏 𝑲 𝒂𝒌,𝒍 , 𝒍 = 𝟏, … , 𝑲 Computation of Attention matrix from node features; and Adjacency Matrix using attention coefficients Multiplication of node features with Adjacency Matrix Production of vector- representation of the graph WiDs: Explanation significance of l-th node ViGAT block
  • 5. ω2 ω2 5 K K objects object-level features b frame-level local features P ω2 P P P ω3 b frame-level global features P ω1 concat u video feature o video frames video-level global feature mean video-level local feature K frame WiDs (local info) frame WiDs (global info) object WiDs P P P Recognized Event: Playing beach volleyball! Explanation: Event supporting frames and objects ViGAT architecture max3 max o: object detector b: feature extractor u: classification head GAT blocks: ω1, ω2, ω3 Global branch: ω1 Local branch: ω2, ω3 Local information Global information
  • 6. 6 • ViGAT has high computational cost due to local (object) information processing (e.g.,P=120 frames, K=50 objects per frame, PK=6000 objects/video) • Efficient video processing has investigated at the top-down (frame) paradigm: - Frame selection policy: identify most important frames for classification - Gating component: stop processing frames when sufficient evidence achieved • Unexplored topic in bottom-up paradigm: Can we use such techniques to reduce the computational complexity in the local processing pipeline of ViGAT? ViGAT
  • 7. 7 K b P Q ω3 concat u video feature o Extracted video frames mean video-level local feature K Frame WiDs (local info) Object WiDs (local info) Q(s) Frame selection policy Q(s) Q(s) Q(s) Q(s) Q(s) g(s) ON/OFF concat max Explanation: Event supporting frames and objects Recognized Event: Playing beach volleyball! Computed video-level global feature Computed frame WiDs (global info) u1 uP max3 Gate is closed: Request Q(s+1) - Q(s) additional frames ζ(s) ζ(1) g(1) g(S) Z(s) Gated-ViGAT ω2 ω2 ω2 Local information processing pipeline
  • 8. 8 • Iterative algorithm to select Q frames frame-level global features frame WiDs (global info) argmax p1 minmax minmax αp = (1/2) (1 – γp Τγpi-1 ) γp = γp /|γp| γ1 γP uP u1 uP u1 α1 αP pi argmax u1 uP up = αp up u1 uP α1 αP 1. Initialize 2. Select Q-1 frames Input: Q, frame index p1, P feature vectors Iterate for i= 2 to Q-1 γ1 γP Gated-ViGAT: Frame selection policy
  • 9. 9 • Each gate has a GAT block-like structure and binary classification head (open/close); corresponds to specified number of frames Q(s); trained to provide 1 (i.e. open) when ViGAT loss is low; design hyperparameters:Q(s) , β (sensitivity) Use frame selection policy to select Q(s) frames for gate g(s) Compute the video-level local feature ζ(s) (and Z(s)) Compute ViGAT classification loss: lce = CE(label, y) Derive pseudolabel 0(s) : 1 if lce <= βes/2 ; zero otherwise Compute gate component loss: 𝐿 = 1 𝑆 𝑠=1 𝑆 𝑙𝑏𝑐𝑒(𝑔 𝑠 𝒁 𝑠 , 𝑜(𝑠) ) Perform backpropagation to update gate weights concat u video feature video-level local feature g(s) concat Computed video-level global feature ζ(s) ζ(1) g(1) g(S) Local ViGAT branch Z(s) Ground truth label cross entropy y Binary cross entropy o(s) Gated-ViGAT: Gate training Select Q(s) video frames for gate g(s) Q o
  • 10. 10 • ActivityNet v1.3: 200 events/actions, 10K/5K training/testing, 5 to 10 mins; multilabel • MiniKinetics: 200 events/actions, 80K/5K training/testing, 10 secs duration; single-label • Video representation: 120/30 frames with uniform sampling for ActivityNet/MiniKinetics • Pretrained ViGAT components: Faster R-CNN (pretrained/finetuned on Imagenet1K/VG, K=50 objects), ViT-B/16 backbone (pretrained/finetuned on Imagenet11K/Imagenet1K), 3 GAT blocks (pretrained on the respective dataset, i.e., ActivityNet or MiniKinetics) • Gates: S= 6 / 5 (number of gates), {Q(s)} = {9, 12, 16, 20, 25, 30} / {2, 4, 6, 8, 10} (sequence lengths), for ActivityNet/MiniKinetics • Gate training hyperparameters: β = 10-8, epochs= 40, lr = 10-4 multiplied with 0.1 at epochs 16, 35 • Evaluation Measures: mAP (ActivityNet), top-1 accuracy (MiniKinetics), FLOPs • Gated-ViGAT is compared against top-scoring methods in the two datasets Experiments
  • 11. 11 Methods in MiniKinetics Top-1% TBN [30] 69.5 BAT [7] 70.6 MARS (3D ResNet) [31] 72.8 Fast-S3D (Inception) [14] 78 ATFR (X3D-S) [18] 78 ATFR (R(2+1D)) [18] 78.2 RMS (SlowOnly) [28] 78.6 ATFR (I3D) [18] 78.8 Ada3D (I3D, Kinetics) [32] 79.2 ATFR (3D Resnet) [18] 79.3 CGNL (Modified ResNet) [17] 79.5 TCPNet (ResNet, Kinetics) [3] 80.7 LgNet (R3D) [3] 80.9 FrameExit (EfficientNet) [1] 75.3 ViGAT [9] 82.1 Gated-ViGAT (proposed) 81.3 • Gated-ViGAT outperforms all top-down approaches • Slightly underperforms ViGAT, but approx. 4 and 5.5 FLOPs reduction • As expected, has higher computational complexity than many top-down approaches (e.g. see [3], [4]) but can provide explanations Methods in ActivityNet mAP% AdaFrame [21] 71.5 ListenToLook [23] 72.3 LiteEval [33] 72.7 SCSampler [25] 72.9 AR-Net [13] 73.8 FrameExit [1] 77.3 AR-Net (EfficientNet) [13] 79.7 MARL (ResNet, Kinetics) [22] 82.9 FrameExit (X3D-S) [1] 87.4 ViGAT [9] 88.1 Gated-ViGAT (proposed) 87.3 FLOPS in 2 datasets ViGAT Gated-ViGAT ActivityNet 137.4 24.8 MiniKinetics 34.4 8.7 Experiments: results *Best and second best performance are denoted with bold and underline
  • 12. 12 • Computed # of videos processed and recognition performance for each gate • Average number of frames for ActivityNet / MiniKinetics: 20 / 7 • Recognition rate drops with gate number increase; this behavior is more clearly shown in ActivityNet (longer videos) • Conclusion: more “easy” videos exit early, more “difficult” videos still difficult to recognize even with many frames (similar conclusion with [1]) ActivityNet g(1) g(2) g(3) g(4) g(5) g(6) # frames 9 12 16 20 25 30 # videos 793 651 722 502 535 1722 mAP% 99.8 94.5 93.8 92.7 86 71.6 MiniKinetics g(1) g(2) g(3) g(4) g(5) # frames 2 4 6 8 10 # videos 179 686 1199 458 2477 Top-1% 84.9 83 81.1 84.9 80.7 Experiments: method insight
  • 13. 13 • Bullfighting (top) and Cricket (bottom) test videos of ActivityNet exited at first gate, i.e., recognized using only 9 frames out of 120 (required with ViGAT) • Frames selected with the proposed policy, both explain recognition result and provide diverse view of the video: help to recognize video with fewer frames Bullfighting Cricket Experiments: examples
  • 14. 14 • Can also provide explanations at object-level (in contrast to top-down methods) “Waterskiing” predicted as “Making a sandwich” “Playing accordion” predicted as “Playing guitarra” “Breakdancing” (correct prediction) Experiments: examples
  • 15. 15 Policy / #frames 10 20 30 Random 83 85.5 86.5 WiD-based 84.9 86.1 86.9 Random on local 85.4 86.6 86.9 WiD-based on local 86.6 87.1 87.5 FrameExit policy 86.2 87.3 87.5 Proposed policy 86.7 87.3 87.6 Gated-ViGAT (proposed) 86.8 87.5 87.7 Experiments: ablation study on frame selection policies • Comparison (mAP%) on ActivityNet • Gated-ViGAT selects diverse frames with high explanation potential • Proposed policy is second best (surpassing FrameExit [1], current SOTA) Random: Θ frames selected randomly for local/global features WiD-Based: Θ frames are selected using global WiDs Random local: P frames derive global feature; Θ frames selected randomly WiD-based local: P frames derive global feature; Θ frames using global WiDs FrameExit policy: Θ frames are selected using policy in [1] Proposed policy: P frames derive global feature; Θ selected using proposed Gated-ViGAT: in addition to above gate component selects Θ frames in average
  • 16. 16 • Top-6 frames of “bungee jumping” video selected with WiD-based vs proposed policy Proposed WiD-based Updated WiDs Experiments: ablation study example
  • 17. 17 • An efficient bottom-up event recognition and explanation approach presented • Utilizes a new policy algorithm to select frames that: a) explain best the classifier’s decision, b) provide diverse information of the underlying event • Utilizes a gating mechanism to instruct the model to stop extracting bottom- up (object) information when sufficient evidence of the event is achieved • Evaluation on 2 datasets provided competitive recognition performance and approx. 5 times FLOPs reduction in comparison to previous SOTA • Future work: investigations for further efficiency improvements, e.g.: faster object detector, feature extractor, frame selection also for the global information pipeline, etc. Conclusions
  • 18. 18 Thank you for your attention! Questions? Nikolaos Gkalelis, gkalelis@iti.gr Vasileios Mezaris, bmezaris@iti.gr Code publicly available at: https://github.com/bmezaris/Gated-ViGAT This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreement 101021866 CRiTERIA