Presentation slides for our paper "Combining Adversarial and Reinforcement Learning for Video Thumbnail Selection", ACM ICMR 2021. https://doi.org/10.1145/3460426.3463630.
We developed a new method for unsupervised video thumbnail selection. The developed network architecture selects video thumbnails based on two criteria: the representativeness and the aesthetic quality of their visual content. Training relies on a combination of adversarial and reinforcement learning. The former is used to train a discriminator, whose goal is to distinguish the original from a reconstructed version of the video based on a small set of candidate thumbnails. The discriminator’s feedback is a measure of the representativeness of the selected thumbnails. This measure is combined with estimates about the aesthetic quality of the thumbnails (made using a SoA Fully Convolutional Network) to form a reward and train the thumbnail selector via reinforcement learning. Experiments on two datasets (OVP and Youtube) show the competitiveness of the proposed method against other SoA approaches. An ablation study with respect to the adopted thumbnail selection criteria documents the importance of considering the aesthetics, and the contribution of this information when used in combination with measures about the representativeness of the visual content.
1. Combining Adversarial and Reinforcement Learning for
Video Thumbnail Selection
E. Apostolidis1,2, E. Adamantidou1, V. Mezaris1, I. Patras2
1 Information Technologies Institute, CERTH, Thermi - Thessaloniki, Greece
2 School of EECS, Queen Mary University of London, London, UK
2021 ACM International Conference
on Multimedia Retrieval
3. Problem statement
2
Video is everywhere!
• Captured by smart devices and instantly
shared online
• Constantly and rapidly increasing volumes
of video content on the Web
Hours of video content uploaded on
YouTube every minute
Image sources: https://www.financialexpress.com/india-news/govt-agencies-adopt-new-age-video-sharing-apps-
like-tiktok/1767354/ (left) & https://www.statista.com/ (right)
4. Problem statement
3
But how to spot what we are looking for in endless collections of video content?
Get a quick idea about a
video’s content by
checking its thumbnail!
Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/
5. Goal of video thumbnail selection technologies
4
Analysis outcomes: a set of
representative video frames
“Select one or a few video frames
that provide a representative and
aesthetically-pleasing overview of
the video content”
Video title: “Susan Boyle's First Audition - I Dreamed a Dream - Britain's Got Talent 2009”
Video source: OVP dataset (video also online available at: https://www.youtube.com/watch?v=deRF9oEbRso)
6. Related work
Early visual-based approaches: Use of hand-crafted rules about the optimal thumbnail, and
tailored features and mechanisms to assess video frames’ alignment with these rules
• Thumbnail selection associated with: appearance and positioning of faces/objects, color diversity,
variance of luminance, scene steadiness, thematic relevance, absence of subtitles
• Main shortcoming: rule definition and features’ engineering are highly-complex tasks
Recent visual-based approaches: Target a few commonly-desired characteristics for a video
thumbnail, and exploit learning efficiency of deep network architectures
• Thumbnail selection associated with: learnable estimates about frames’ representativeness and
aesthetic quality (focusing also on faces), learnable classifications of good and bad frames
Recent multimodal approaches: Exploit data from additional modalities or auxiliary sources
• Video thumbnail selection is associated with: extracted keywords from the video metadata, databases
with visually-similar content, latent representations of textual and audio data, textual user queries
5
8. Developed approach
Network architecture
• Thumbnail Selector
• Estimating frames’ aesthetic quality
• Estimating frames’ importance
• Fusing estimations and selecting a small set of
candidate thumbnails
• Thumbnail Evaluator
• Evaluating thumbnails’ aesthetic quality
• Evaluating thumbnails’ representativeness
• Fusing evaluations (rewards)
• Thumbnail Evaluator Thumbnail Selector
• Using the overall reward for reinforcement learning
7
9. Developed approach
Learning objectives and pipeline
• We follow a step-wise learning approach
• Step 1: Update Encoder based on:
LRecon= “distance between original and reconstructed
feature vectors, based on a latent representation in
the last hidden layer of the Discriminator”
LPrior: “information loss when using Encoder’s latent
space to represent the prior distribution defined by
the Variational Auto-Encoder”
8
10. Developed approach
Learning objectives and pipeline
• We follow a step-wise learning approach
• Step 2: Update Decoder based on:
LRecon= “distance between original and reconstructed
feature vectors, based on a latent representation in the
last hidden layer of the Discriminator”
LGEN: “difference between Discriminator’s output when
seeing the thumbnail-based reconstructed feature vectors
and the label (“1”) associated to the original video”
8
11. Developed approach
Learning objectives and pipeline
• We follow a step-wise learning approach
• Step 3: Update Discriminator based on:
LORIG = “difference between Discriminator’s output when
seeing the original feature vectors and the label (“1”)
associated to the original video”
LSUM: “difference between Discriminator’s output when
seeing the thumbnail-based reconstructed feature
vectors and the label (“0”) associated to the thumbnail-
based video summary”
8
12. Developed approach
Learning objectives and pipeline
• We follow a step-wise learning approach
• Step 4: Update Importance Estimator based
on the Episodic REINFORCE algorithm
8
13. Experiments
9
Datasets
• Open Video Project (OVP) (https://sites.google.com/site/vsummsite/download)
• 50 videos of various genres (e.g. documentary, educational, historical, lecture)
• Video length: 46 sec. to 3.5 min.
• Annotation: keyframe-based video summaries (5 per video)
• Youtube (https://sites.google.com/site/vsummsite/download)
• 50 videos of diverse content (e.g. news, TV-shows, sports, commercials) collected from the Web
• Video length: 9 sec. to 11 min.
• Annotation: keyframe-based video summaries (5 per video)
14. Experiments
10
Evaluation approach
• Ground-truth thumbnails: top-3 selected keyframes by human annotators
• Evaluation measures:
• Precision at 1 (P@1): matching ground-truth with top-1 machine-selected thumbnail
• Precision at 3 (P@3): matching ground-truth with top-3 machine-selected thumbnails
• Measure performance also when using only the top-1 selected keyframe by human
annotators as the ground-truth
• Run experiments on 10 different randomly-created splits of the used data (80% training;
20% testing) and report the average performance over these runs
15. Experiments
11
Performance comparisons using top-3 human selected keyframes as ground-truth
OVP Youtube
P@1 P@3 P@1 P@3
Baseline (random) 15.79% 32.51% 7.53% 17.94%
Mahasseni et al., (2017) - 7.80% - 11.34%
Song et al., (2016) - 11.72% - 16.47%
Gu et al., (2018) - 12.18% - 18.25%
Apostolidis et al., (2021) 15.00% 24.00% 8.75% 15.00%
Proposed approach 31.00% 40.00% 15.00% 20.00%
B. Mahasseni, et al., (2017). Unsupervised Video Summarization with Adversarial LSTM Networks. Proc. CVPR 2017, pp. 2982–2991.
Y. Song, et al., (2016). To Click or Not To Click: Automatic Selection of Beautiful Thumbnails from Videos. Proc. CIKM ’16, pp. 659–668.
H. Gu et al., (2018). From Thumbnails to Summaries - A Single Deep Neural Network to Rule Them All. Proc. ICME 2018 , pp. 1–6.
E. Apostolidis, et al., (2021). AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks
for Unsupervised Video Summarization. IEEE Trans. CSVT , vol. 31, no. 8, pp. 3278-3292, Aug. 2021.
16. Experiments
12
Performance comparisons using top-1 human selected keyframes as ground-truth
OVP Youtube
P@1 P@3 P@1 P@3
Baseline (random) 6.36% 16.66% 4.23% 9.98%
Apostolidis et al., (2021) 7.00% 14.00% 6.25% 8.75%
Proposed approach 17.00% 21.00% 10.00% 16.25%
17. Experiments
13
Ablation study
Thumbnail selection criteria OVP Youtube
Aesthetics
estimations
Represent.
estimations
Using top-3
human selections
Using top-1
human selections
Using top-3
human selections
Using top-1
human selections
Frame
picking
Reward P@1 P@3 P@1 P@3 P@1 P@3 P@1 P@3
Baseline
(random)
- - - 15.79 32.51 6.36 16.66 7.53 17.94 4.23 9.98
Variant #1 √ √ X 16.00 20.00 8.00 12.00 6.00 17.50 5.00 7.50
Variant #2 X X √ 20.00 30.00 8.00 13.00 10.00 18.75 3.75 8.75
Variant #3 √ X √ 12.00 36.00 3.00 18.00 10.00 18.75 6.25 12.50
Variant #4 X √ √ 30.00 39.00 18.00 23.00 13.75 16.25 10.00 12.50
Proposed
approach
√ √ √ 31.00 40.00 17.00 21.00 15.00 20.00 10.00 16.25
18. Conclusions
14
• Deep network architecture for video thumbnail selection, trained by combining adversarial
and reinforcement learning
• Thumbnail selection relies on representativeness and aesthetic quality of video frames
• Representativeness measured by an adversarially-trained discriminator
• Aesthetic quality estimated by a pretrained Fully Convolutional Network
• An overall reward is used to train Thumbnail Selector via reinforcement learning
• Experiments on two benchmark datasets (OVP and Youtube):
• Showed the advanced performance of our method against other SoA video thumbnail selection or
summarization approaches
• Documented the importance of aesthetics for the video thumbnail selection task
19. Thank you for your attention!
Questions?
Evlampios Apostolidis, apostolid@iti.gr
Code and documentation publicly available at:
https://github.com/e-apostolidis/Video-Thumbnail-Selector
This work was supported by the EUs Horizon 2020 research and innovation programme under
grant agreement H2020-832921 MIRROR, and by EPSRC under grant No. EP/R026424/1