SlideShare una empresa de Scribd logo
1 de 50
Descargar para leer sin conexión
TRECVID 2016
Video to Text Description
NEW Showcase / Pilot Task(s)
Alan Smeaton
Dublin City University
Marc Ritter
Technical University Chemnitz
GeorgeAwad
NIST; Dakota Consulting, Inc
1TRECVID 2016
Goals and Motivations
ü Measure how well an automatic system can describe a video in
natural language.
ü Measure how well an automatic system can match high-level
textual descriptions to low-level computer vision features.
ü Transfer successful image captioning technology to the video
domain.
Real world Applications
ü Video summarization
ü Supporting search and browsing
ü Accessibility - video description to the blind
ü Video event prediction
2TRECVID 2016
• Given a set of :
Ø 2000 URLs of Twitter vine videos.
Ø 2 sets (A and B) of text descriptions for each of 2000 videos.
• Systems are asked to submit results for two subtasks:
1.  Matching & Ranking:
Return for each URL a ranked list of the most likely text description from
each set of A and of B.
2.  Description Generation:
Automatically generate a text description for each URL.
3
TASK
TRECVID 2016
Video Dataset
•  Crawled 30k+ Twitter vine video URLs.
•  Max video duration == 6 sec.
•  A subset of 2000 URLs randomly selected.
•  Marc Ritter’s TUC Chemnitz group supported manual
annotations:
•  Each video annotated by 2 persons (A and B).
•  In total 4000 textual descriptions (1 sentence each) were produced.
•  Annotation guidelines by NIST:
•  For each video, annotators were asked to combine 4 facets if applicable:
• Who is the video describing (objects, persons, animals, …etc) ?
• What are the objects and beings doing (actions, states, events, …etc) ?
• Where (locale, site, place, geographic, ...etc) ?
• When (time of day, season, ...etc) ?
4TRECVID 2016
Annotation Process Obstacles
§  Bad video quality
§  A lot of simple scenes/events with
repeating plain descriptions
§  A lot of complex scenes containing
too many events to be described
§  Clips sometimes appear too short
for a convenient description
§  Audio track relevant for description
but has not been used to avoid
semantic distractions
§  Non-English Text overlays/subtitles
hard to understand
§  Cultural differences in reception of
events/scene content
§  Finding a neutral scene description
appears as a challenging task
§  Well-known people in videos may
have influenced (inappropriately)
the description of scenes
§  Specifying time of day (frequently)
impossible for indoor-shots
§  Description quality suffers from long
annotation hours
§  Some offline vines were detected
§  A lot of vines with redundant or
even identical content
TRECVID 2016 5
Annotation UI Overview
TRECVID 2016 6
Annotation Process
TRECVID 2016 7
Annotation Statistics
UID # annotations Ø (sec) (sec) (sec) # time (hh:mm:ss)
0 700 62.16 239.00 40.00 12:06:12
1 500 84.00 455.00 13.00 11:40:04
2 500 56.84 499.00 09.00 07:53:38
3 500 81.12 491.00 12.00 11:16:00
4 500 234.62 499.00 33.00 32:35:09
5 500 165.38 493.00 30.00 22:58:12
6 500 57.06 333.00 10.00 07:55:32
7 500 64.11 495.00 12.00 08:54:15
8 200 82.14 552.00 68.00 04:33:47
total 4400 98.60 552.00 09.00 119:52:49
TRECVID 2016 8
TRECVID 2016 9
Samples of captions
A B
a dog jumping onto a couch a dog runs against a couch indoors at
daytime
in the daytime, a driver let the
steering wheel of car and slip
on the slide above his car in the
street
on a car on a street the driver climb out of his
moving car and use the slide on cargo area
of the car
an asian woman turns her head an asian young woman is yelling at another
one that poses to the camera
a woman sings outdoors a woman walks through a floor at daytime
a person floating in a wind
tunnel
a person dances in the air in a wind tunnel
Run Submissions & Evaluation Metrics
•  Up to 4 runs per set (for A and for B) were allowed in
the Matching & Ranking subtask.
•  Up to 4 runs in the Description Generation subtask.
•  Mean inverted rank measured the Matching & Ranking
subtask.
•  Machine Translation metrics including BLEU (BiLingual
Evaluation Understudy) and METEOR (Metric for
Evaluation of Translation with Explicit Ordering) were
used to score the Description Generation subtask.
•  An experimental “Semantic Textual Similarity” metric
(STS) was also tested.
10TRECVID 2016
BLEU and METEOR
•  BLEU [0..1] used in MT (Machine Translation) to evaluate
quality of text. It approximate human judgement at a
corpus level.
•  Measures the fraction of N-grams (up to 4-gram) in
common between source and target.
•  N-gram matches for a high N (e.g., 4) rarely occur at
sentence-level, so poor performance of BLEU@N
especially when comparing only individual sentences,
better comparing paragraphs or higher.
•  Often we see B@1, B@2, B@3, B@4 … we do B@4.
•  Heavily influenced by number of references available.
TRECVID 2016 11
METEOR
•  METEOR Computes unigram precision and recall,
extending exact word matches to include similar words
based on WordNet synonyms and stemmed tokens
•  Based on the harmonic mean of unigram precision and
recall, with recall weighted higher than precision
•  This is an active area … CIDEr (Consensus-Based Image
Description Evaluation) is another recent metric … no
universally agreed metric(s)
TRECVID 2016 12
UMBC STS measure [0..1]
•  We’re exploring STS – based on distributional similarity
and Latent Semantic Analysis (LSA) … complemented
with semantic relations extracted from WordNet
TRECVID 2016 13
Participants (7 out of 11 teams finished)
Matching & Ranking Description Generation
DCU ü  ü 
INF(ormedia) ü  ü 
Mediamill (AMS) ü  ü 
NII (Japan + Vietnam) ü  ü 
Sheffield_UETLahore ü  ü 
VIREO (CUHK) ü 
Etter Solutions ü 
14TRECVID 2016
Total of 46 runs Total of 16 runs
Task 1: Matching & Ranking
15TRECVID 2016
Person reading newspaper outdoors at daytime
Three men running in the street at daytime
Person playing golf outdoors in the field
Two men looking at laptop in an office
x 2000 x 2000 type A … and ... X 2000 type B
Matching & Ranking results by run
16TRECVID 2016
0
0.02
0.04
0.06
0.08
0.1
0.12
mediamill_task1_set.B.run2.txt
mediamill_task1_set.B.run4.txt
mediamill_task1_set.B.run3.txt
mediamill_task1_set.B.run1.txt
mediamill_task1_set.A.run2.txt
mediamill_task1_set.A.run3.txt
mediamill_task1_set.A.run1.txt
mediamill_task1_set.A.run4.txt
vireo_fusing_all.B.txt
vireo_fusing_flat.A.txt
vireo_fusing_flat.B.txt
vireo_fusing_average.B.txt
vireo_fusing_all.A.txt
vireo_fusing_average.A.txt
vireo_concept.B.txt
vireo_concept.A.txt
etter_mandr.B.1
etter_mandr.B.2
etter_mandr.A.1
DCU.adapt.bm25.B.swaped.txt
etter_mandr.A.2
DCU.adapt.bm25.A.swaped.txt
INF.ranked_list.B.no_score.txt
DCU.adapt.fusion.B.swaped.txt.txt
DCU.adapt.fusion.A.swaped.txt.txt
INF.ranked_list.A.no_score.txt
INF.ranked_list.A.new.txt
INF.ranked_list.B.new.txt
NII.run-2.A.txt
DCU.vines.textDescription.A.testing
NII.run-4.A.txt
NII.run-1.B.txt
NII.run-1.A.txt
DCU.vines.textDescription.B.testing
DCU.fused.B.txt
NII.run-3.B.txt
NII.run-3.A.txt
NII.run-2.B.txt
Sheffield_UETLahore.ranklist.B.test
Sheffield_UETLahore.ranklist.A.test
NII.run-4.B.txt
DCU.fused.A.txt
Sheffield_UETLahore.ranklist.B.test
Sheffield_UETLahore.ranklist.A.test
INF.epoch-38.B.txt
INF.epoch-38.A.txt
MeanInvertedRank
Submitted runs
MediaMill
Vireo
Etter
DCU
INF(ormedia)
NII
Sheffield
Matching & Ranking results by run
17TRECVID 2016
0
0.02
0.04
0.06
0.08
0.1
0.12
mediamill_task1_set.B.run2.txt
mediamill_task1_set.B.run4.txt
mediamill_task1_set.B.run3.txt
mediamill_task1_set.B.run1.txt
mediamill_task1_set.A.run2.txt
mediamill_task1_set.A.run3.txt
mediamill_task1_set.A.run1.txt
mediamill_task1_set.A.run4.txt
vireo_fusing_all.B.txt
vireo_fusing_flat.A.txt
vireo_fusing_flat.B.txt
vireo_fusing_average.B.txt
vireo_fusing_all.A.txt
vireo_fusing_average.A.txt
vireo_concept.B.txt
vireo_concept.A.txt
etter_mandr.B.1
etter_mandr.B.2
etter_mandr.A.1
DCU.adapt.bm25.B.swaped.txt
etter_mandr.A.2
DCU.adapt.bm25.A.swaped.txt
INF.ranked_list.B.no_score.txt
DCU.adapt.fusion.B.swaped.txt.txt
DCU.adapt.fusion.A.swaped.txt.txt
INF.ranked_list.A.no_score.txt
INF.ranked_list.A.new.txt
INF.ranked_list.B.new.txt
NII.run-2.A.txt
DCU.vines.textDescription.A.testing
NII.run-4.A.txt
NII.run-1.B.txt
NII.run-1.A.txt
DCU.vines.textDescription.B.testing
DCU.fused.B.txt
NII.run-3.B.txt
NII.run-3.A.txt
NII.run-2.B.txt
Sheffield_UETLahore.ranklist.B.test
Sheffield_UETLahore.ranklist.A.test
NII.run-4.B.txt
DCU.fused.A.txt
Sheffield_UETLahore.ranklist.B.test
Sheffield_UETLahore.ranklist.A.test
INF.epoch-38.B.txt
INF.epoch-38.A.txt
MeanInvertedRank
Submitted runs
‘B’ runs (colored/
team) seem to
be doing better
than ‘A’
MediaMill
Vireo
Etter
DCU
INF(ormedia)
NII
Sheffield
Runs vs. matches
18TRECVID 2016
0
100
200
300
400
500
600
700
800
900
1 2 3 4 5 6 7 8 9 10
Matchesnotfoundbyruns
Number of runs that missed a match
All matches were found by different runs
5 runs didn’t find
any of 805
matches
Matched ranks frequency across all runs
19TRECVID 2016
0
100
200
300
400
500
600
700
800
1
10
19
28
37
46
55
64
73
82
91
100
Numberofmatches
Rank 1 - 100
Set ‘B’
0
100
200
300
400
500
600
700
800
1
10
19
28
37
46
55
64
73
82
91
100
Numberofmatches
Rank 1 - 100
Set ‘A’
Very similar rank distribution
20TRECVID 2016
Videos vs. Ranks
1 10 100 1000 10000
Rank
Videos
Top 10 ranked & matched videos (set A)
626
1816
1339
1244
1006
527
1201
1387
1271
324
#Video Id
10 000
21TRECVID 2016
Videos vs. Ranks
1 10 100 1000
Rank
Videos
Top 3 ranked & matched videos (set A)
1387 (Top 3)
1271 (Top 2)
324 (Top1)
#Video Id
Samples of top 3 results (set A)
TRECVID 2016 22
#1271
a woman and a man are kissing each other
#1387
a dog imitating a baby by crawling on the floor
in a living room
#324
a dog is licking its nose
23TRECVID 2016
Videos vs. Ranks
1 10 100 1000 10000
Rank
Videos
Bottom 10 ranked & matched videos (set A)
220
732
1171
481
1124
579
754
443
1309
1090
#Video Id
10 000
24TRECVID 2016
Videos vs. Ranks
1 10 100 1000 10000
Rank
Videos
Bottom 3 ranked & matched videos (set A)
220
732
1171
#Video Id
10 000
Samples of bottom 3 results (set A)
TRECVID 2016 25
#1171
3 balls hover in front of a man
#220
2 soccer players are playing rock-paper-scissors
on a soccer field
#732
a person wearing a costume and holding a chainsaw
26TRECVID 2016
Videos vs. Ranks
1 10 100 1000 10000
Rank
Videos
Top 10 ranked & matched videos (set B)
1128
40
374
752
955
777
1366
1747
387
761
#Video Id
10 000
27TRECVID 2016
Videos vs. Ranks
1 10 100 1000 10000
Rank
Videos
Top 3 ranked & matched videos (set B)
1747
387
761
#Video Id
10 000
Samples of top 3 results (set B)
TRECVID 2016 28
#761
White guy playing the guitar in a room
#387
An Asian young man sitting is eating something yellow
#1747
a man sitting in a room is giving baby something
to drink and it starts laughing
29TRECVID 2016
Videos vs. Ranks
1 10 100 1000 10000
Rank
Videos
Bottom 10 ranked & matched videos (set B)
1460
674
79
345
1475
605
665
414
1060
144
#Video Id
10 000
30TRECVID 2016
Videos vs. Ranks
1 10 100 1000 10000
Rank
Videos
Bottom 3 ranked & matched videos (set B)
414
1060
144
#Video Id
10 000
Samples of bottom 3 results (set B)
TRECVID 2016 31
#144
A man touches his chin in a tv show
#1060
A man piggybacking another man outdoors
#414
a woman is following a man walking on the street at
daytime trying to talk with him
Lessons Learned ?
•  Can we say something about A vs B
•  At the top end we’re not so bad … best results can find
the correct caption in almost top 1% of ranking
TRECVID 2016 32
Task 2: Description Generation
TRECVID 2016 33
“a dog is licking its nose”
Given a video
Generate a textual description
Metrics
•  Popular MT measures : BLEU , METEOR
•  Semantic textual similarity measure (STS).
•  All runs and GT were normalized (lowercase,
punctuations, stop words, stemming) before
evaluation by MT metrics (except STS)
Who ? What ? Where ? When ?
BLEU results
0
0.005
0.01
0.015
0.02
0.025
BLEUscore
Overall system scores
TRECVID 2016 34
INF(ormedia)
Sheffield
NII
MediaMill
DCU
BLEU stats sorted by median value
0
0.2
0.4
0.6
0.8
1
1.2
BLEUscore
BLEU stats across 2000 videos per run
Min
Max
Median
TRECVID 2016 35
METEOR results
0
0.05
0.1
0.15
0.2
0.25
0.3
METEORscore
Overall system score
TRECVID 2016 36
INF(ormedia)
Sheffield
NII
MediaMill
DCU
METEOR stats sorted by median value
0
0.2
0.4
0.6
0.8
1
1.2
METEORscore
METEOR stats across 2000 videos per run
Min
Max
Median
TRECVID 2016 37
Semantic Textual Similarity (STS) sorted
by median value
0
0.2
0.4
0.6
0.8
1
1.2
STSscore
STS stats across 2000 videos per run
Min
Max
Median
TRECVID 2016 38
‘A’ runs
seems to
be doing
better
than ‘B’Mediamill(A)
INF(A)
Sheffield_UET(A)
NII(A)
DCU(A)
STS(A, B) Sorted by STS value
TRECVID 2016 39
An example from run submissions
– 7 unique examples
1.  a girl is playing with a baby
2.  a little girl is playing with a dog
3.  a man is playing with a woman in a
room
4.  a woman is playing with a baby
5.  a man is playing a video game and
singing
6.  a man is talking to a car
7.  A toddler and a dog
TRECVID 2016 40
Participants
•  High level descriptions of what groups did from their
papers … more details on posters
TRECVID 2016 41
Participant: DCU
Task A: Caption Matching
•  Preprocess 10 frames/video to detect 1,000 objects
(VGG-16 CNN from ImageNet), 94 crowd behaviour
concepts (WWW dataset), locations (Place2 dataset on
VGG16)
•  4 runs, baseline BM25, Word2vec, and fusion
Task B: Caption Generation
•  Train on MS-COCO using NeuralTalk2, a RNN
•  One caption per keyframe, captions then fused
TRECVID 2016 42
Participant: Informedia
Focus on generalization ability of caption models, ignoring
Who, What, Where, When facets
Trained 4 caption models on 3 datasets (MS-COCO, MS-
VD, MSR-VTT), achieving sota on those models based on
VGGNet concepts and Hierarchical Recurrent Neural
Encoder for temporal aspects
Task B: Caption Generation
•  Results explore transfer models to TRECVid-VTT
TRECVID 2016 43
Participant: MediaMill
Task A: Caption Matching
Task B: Caption Generation
TRECVID 2016 44
Participant: NII
Task A: Caption Matching
•  3DCNN for video representation trained on MSR-VTT +
1970 YouTube2Text + 1M captioned images
•  4 run variants submitted, concluding the approach did not
generalise well on test set and suffers from over-fitting
Task B: Caption Generation
•  Trained on 6500 videos from MSR-VTT dataset
•  Confirmed that multimodal feature fusion works best, with
audio features surprisingly good
TRECVID 2016 45
Participant: Sheffield / Lahore
Task A: Caption Matching
Did some run
Task B: Caption Generation
•  Identified a variety of high level concepts for frames
•  Detect and recognize faces, age and gender, emotion,
objects, (human) actions
•  Varied the frequency of frames for each type of
recognition
•  Runs based on combinations of feature types
TRECVID 2016 46
Participant: VIREO (CUHK)
Adopted their zero-example MED system in reverse
Used a concept bank of 2000 concepts trained on MSR-
VTT, Flickr30k, MS-COCO and TGIF datasets
Task A: Caption Matching
•  4(+4) runs testing traditional concept-based approach vs
attention-based deep models, finding deep models
perform better, motion features dominate performance
TRECVID 2016 47
Participant: Etter Solutions
Task A: Caption Matching
•  Focused on concepts for Who, What, When, Where
•  Used a subset of ImageNet plus scene categories from
the Places database
•  Applied concepts to 1 fps (frame per second) with sliding
window, mapped this to “document” vector, and calculated
similarity score
TRECVID 2016 48
Observations
•  Good participation, good finishing %, ‘B’ runs did better than ‘A’ in matching &
ranking while ‘A’ did better than ‘B’ in the semantic similarity.
•  METEOR scores are higher than BLEU, we should have used CIDEr also (some
participants did)
•  STS as a metric has some questions, making us ask what makes more sense?
MT metrics or semantic similarity ? Which metric measures real system
performance in a realistic application ?
•  Lots of available training sets, some overlap ... MSR-VTT, MS-COCO, Place2,
ImageNet, YouTube2Text, MS-VD .. Some trained with AMT (MSR-VTT-10k has
10,000 videos, 41.2 hours and 20 annotations each !)
•  What did individual teams learn ?
•  Do we need more reference (GT) sets ? (good for MT metrics)
•  Should we run again as pilot ? How many videos to annotate, how many
annotations on each?
•  Only some systems applied the 4-facet description in their submissions ?
TRECVID 2016 49
Observations
•  There are other video-to-caption challenges like ACM
MULTIMEDIA 2016 Grand Challenges
•  Images from YFCC100N with captions in a caption-
matching/prediction task for 36884 test images. Majority
of participants used CNNs and RNNs
•  Video MSR VTT with 41.2h, 10000 clips each with x20
AMT captions … evaluation measures BLEU, METEOR,
CIDEr and ROUGE-L ... GC results do not get aggregated
and disssipate at the ACM MM Conference, so hard to
gauge.
TRECVID 2016 50

Más contenido relacionado

Similar a TRECVID 2016 : Video to Text Description

TRECVID 2019 Video to Text
TRECVID 2019 Video to TextTRECVID 2019 Video to Text
TRECVID 2019 Video to TextGeorge Awad
 
[TRECVID 2018] Video to Text
[TRECVID 2018] Video to Text[TRECVID 2018] Video to Text
[TRECVID 2018] Video to TextGeorge Awad
 
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)Konstantinos Zagoris
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat Politècnica de Catalunya
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia VoulibasiISSEL
 
The Video-to-Text task evaluation at TRECVID
The Video-to-Text task evaluation at TRECVIDThe Video-to-Text task evaluation at TRECVID
The Video-to-Text task evaluation at TRECVIDGeorge Awad
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.pptRahulTr22
 
Data science programming .ppt
Data science programming .pptData science programming .ppt
Data science programming .pptGanesh E
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.pptkalai75
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.pptAravind Reddy
 
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsBreaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsJason Anderson
 
Multimodal Features for Search and Hyperlinking of Video Content
Multimodal Features for Search and Hyperlinking of Video ContentMultimodal Features for Search and Hyperlinking of Video Content
Multimodal Features for Search and Hyperlinking of Video ContentPetra Galuscakova
 
Runtime Performance Optimizations for an OpenFOAM Simulation
Runtime Performance Optimizations for an OpenFOAM SimulationRuntime Performance Optimizations for an OpenFOAM Simulation
Runtime Performance Optimizations for an OpenFOAM SimulationFisnik Kraja
 
DITA and Localization: Bringing the Best Together
DITA and Localization: Bringing the Best TogetherDITA and Localization: Bringing the Best Together
DITA and Localization: Bringing the Best TogetherLavaCon
 
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobilNLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobilDatabricks
 
Develop Maintainable Apps - edUiConf
Develop Maintainable Apps - edUiConfDevelop Maintainable Apps - edUiConf
Develop Maintainable Apps - edUiConfAnnyce Davis
 

Similar a TRECVID 2016 : Video to Text Description (20)

TRECVID 2019 Video to Text
TRECVID 2019 Video to TextTRECVID 2019 Video to Text
TRECVID 2019 Video to Text
 
[TRECVID 2018] Video to Text
[TRECVID 2018] Video to Text[TRECVID 2018] Video to Text
[TRECVID 2018] Video to Text
 
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Meetup "Paris NLP"
Meetup "Paris NLP" Meetup "Paris NLP"
Meetup "Paris NLP"
 
data science
data sciencedata science
data science
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
The Video-to-Text task evaluation at TRECVID
The Video-to-Text task evaluation at TRECVIDThe Video-to-Text task evaluation at TRECVID
The Video-to-Text task evaluation at TRECVID
 
Data Science
Data Science Data Science
Data Science
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
 
Data science programming .ppt
Data science programming .pptData science programming .ppt
Data science programming .ppt
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
 
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsBreaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
 
Multimodal Features for Search and Hyperlinking of Video Content
Multimodal Features for Search and Hyperlinking of Video ContentMultimodal Features for Search and Hyperlinking of Video Content
Multimodal Features for Search and Hyperlinking of Video Content
 
Runtime Performance Optimizations for an OpenFOAM Simulation
Runtime Performance Optimizations for an OpenFOAM SimulationRuntime Performance Optimizations for an OpenFOAM Simulation
Runtime Performance Optimizations for an OpenFOAM Simulation
 
DITA and Localization: Bringing the Best Together
DITA and Localization: Bringing the Best TogetherDITA and Localization: Bringing the Best Together
DITA and Localization: Bringing the Best Together
 
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobilNLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
 
Origo mdn 2015
Origo   mdn 2015Origo   mdn 2015
Origo mdn 2015
 
Develop Maintainable Apps - edUiConf
Develop Maintainable Apps - edUiConfDevelop Maintainable Apps - edUiConf
Develop Maintainable Apps - edUiConf
 

Más de George Awad

Movie Summarization at TRECVID 2022
Movie Summarization at TRECVID 2022Movie Summarization at TRECVID 2022
Movie Summarization at TRECVID 2022George Awad
 
Disaster Scene Indexing at TRECVID 2022
Disaster Scene Indexing at TRECVID 2022Disaster Scene Indexing at TRECVID 2022
Disaster Scene Indexing at TRECVID 2022George Awad
 
Activity Detection at TRECVID 2022
Activity Detection at TRECVID 2022Activity Detection at TRECVID 2022
Activity Detection at TRECVID 2022George Awad
 
Multimedia Understanding at TRECVID 2022
Multimedia Understanding at TRECVID 2022Multimedia Understanding at TRECVID 2022
Multimedia Understanding at TRECVID 2022George Awad
 
Video Search at TRECVID 2022
Video Search at TRECVID 2022Video Search at TRECVID 2022
Video Search at TRECVID 2022George Awad
 
Video Captioning at TRECVID 2022
Video Captioning at TRECVID 2022Video Captioning at TRECVID 2022
Video Captioning at TRECVID 2022George Awad
 
TRECVID 2019 introduction slides
TRECVID 2019 introduction slidesTRECVID 2019 introduction slides
TRECVID 2019 introduction slidesGeorge Awad
 
TRECVID 2019 Instance Search
TRECVID 2019 Instance SearchTRECVID 2019 Instance Search
TRECVID 2019 Instance SearchGeorge Awad
 
TRECVID 2019 Ad-hoc Video Search
TRECVID 2019 Ad-hoc Video SearchTRECVID 2019 Ad-hoc Video Search
TRECVID 2019 Ad-hoc Video SearchGeorge Awad
 
[TRECVID 2018] Instance Search
[TRECVID 2018] Instance Search[TRECVID 2018] Instance Search
[TRECVID 2018] Instance SearchGeorge Awad
 
[TRECVID 2018] Ad-hoc Video Search
[TRECVID 2018] Ad-hoc Video Search[TRECVID 2018] Ad-hoc Video Search
[TRECVID 2018] Ad-hoc Video SearchGeorge Awad
 
Instance Search (INS) task at TRECVID
Instance Search (INS) task at TRECVIDInstance Search (INS) task at TRECVID
Instance Search (INS) task at TRECVIDGeorge Awad
 
Activity detection at TRECVID
Activity detection at TRECVIDActivity detection at TRECVID
Activity detection at TRECVIDGeorge Awad
 
Introduction about TRECVID (The Video Retreival benchmark)
Introduction about TRECVID (The Video Retreival benchmark)Introduction about TRECVID (The Video Retreival benchmark)
Introduction about TRECVID (The Video Retreival benchmark)George Awad
 
TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)
TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)
TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)George Awad
 
TRECVID 2016 : Multimedia event Detection
TRECVID 2016 : Multimedia event DetectionTRECVID 2016 : Multimedia event Detection
TRECVID 2016 : Multimedia event DetectionGeorge Awad
 
TRECVID 2016 workshop
TRECVID 2016 workshopTRECVID 2016 workshop
TRECVID 2016 workshopGeorge Awad
 

Más de George Awad (17)

Movie Summarization at TRECVID 2022
Movie Summarization at TRECVID 2022Movie Summarization at TRECVID 2022
Movie Summarization at TRECVID 2022
 
Disaster Scene Indexing at TRECVID 2022
Disaster Scene Indexing at TRECVID 2022Disaster Scene Indexing at TRECVID 2022
Disaster Scene Indexing at TRECVID 2022
 
Activity Detection at TRECVID 2022
Activity Detection at TRECVID 2022Activity Detection at TRECVID 2022
Activity Detection at TRECVID 2022
 
Multimedia Understanding at TRECVID 2022
Multimedia Understanding at TRECVID 2022Multimedia Understanding at TRECVID 2022
Multimedia Understanding at TRECVID 2022
 
Video Search at TRECVID 2022
Video Search at TRECVID 2022Video Search at TRECVID 2022
Video Search at TRECVID 2022
 
Video Captioning at TRECVID 2022
Video Captioning at TRECVID 2022Video Captioning at TRECVID 2022
Video Captioning at TRECVID 2022
 
TRECVID 2019 introduction slides
TRECVID 2019 introduction slidesTRECVID 2019 introduction slides
TRECVID 2019 introduction slides
 
TRECVID 2019 Instance Search
TRECVID 2019 Instance SearchTRECVID 2019 Instance Search
TRECVID 2019 Instance Search
 
TRECVID 2019 Ad-hoc Video Search
TRECVID 2019 Ad-hoc Video SearchTRECVID 2019 Ad-hoc Video Search
TRECVID 2019 Ad-hoc Video Search
 
[TRECVID 2018] Instance Search
[TRECVID 2018] Instance Search[TRECVID 2018] Instance Search
[TRECVID 2018] Instance Search
 
[TRECVID 2018] Ad-hoc Video Search
[TRECVID 2018] Ad-hoc Video Search[TRECVID 2018] Ad-hoc Video Search
[TRECVID 2018] Ad-hoc Video Search
 
Instance Search (INS) task at TRECVID
Instance Search (INS) task at TRECVIDInstance Search (INS) task at TRECVID
Instance Search (INS) task at TRECVID
 
Activity detection at TRECVID
Activity detection at TRECVIDActivity detection at TRECVID
Activity detection at TRECVID
 
Introduction about TRECVID (The Video Retreival benchmark)
Introduction about TRECVID (The Video Retreival benchmark)Introduction about TRECVID (The Video Retreival benchmark)
Introduction about TRECVID (The Video Retreival benchmark)
 
TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)
TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)
TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)
 
TRECVID 2016 : Multimedia event Detection
TRECVID 2016 : Multimedia event DetectionTRECVID 2016 : Multimedia event Detection
TRECVID 2016 : Multimedia event Detection
 
TRECVID 2016 workshop
TRECVID 2016 workshopTRECVID 2016 workshop
TRECVID 2016 workshop
 

Último

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Último (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

TRECVID 2016 : Video to Text Description

  • 1. TRECVID 2016 Video to Text Description NEW Showcase / Pilot Task(s) Alan Smeaton Dublin City University Marc Ritter Technical University Chemnitz GeorgeAwad NIST; Dakota Consulting, Inc 1TRECVID 2016
  • 2. Goals and Motivations ü Measure how well an automatic system can describe a video in natural language. ü Measure how well an automatic system can match high-level textual descriptions to low-level computer vision features. ü Transfer successful image captioning technology to the video domain. Real world Applications ü Video summarization ü Supporting search and browsing ü Accessibility - video description to the blind ü Video event prediction 2TRECVID 2016
  • 3. • Given a set of : Ø 2000 URLs of Twitter vine videos. Ø 2 sets (A and B) of text descriptions for each of 2000 videos. • Systems are asked to submit results for two subtasks: 1.  Matching & Ranking: Return for each URL a ranked list of the most likely text description from each set of A and of B. 2.  Description Generation: Automatically generate a text description for each URL. 3 TASK TRECVID 2016
  • 4. Video Dataset •  Crawled 30k+ Twitter vine video URLs. •  Max video duration == 6 sec. •  A subset of 2000 URLs randomly selected. •  Marc Ritter’s TUC Chemnitz group supported manual annotations: •  Each video annotated by 2 persons (A and B). •  In total 4000 textual descriptions (1 sentence each) were produced. •  Annotation guidelines by NIST: •  For each video, annotators were asked to combine 4 facets if applicable: • Who is the video describing (objects, persons, animals, …etc) ? • What are the objects and beings doing (actions, states, events, …etc) ? • Where (locale, site, place, geographic, ...etc) ? • When (time of day, season, ...etc) ? 4TRECVID 2016
  • 5. Annotation Process Obstacles §  Bad video quality §  A lot of simple scenes/events with repeating plain descriptions §  A lot of complex scenes containing too many events to be described §  Clips sometimes appear too short for a convenient description §  Audio track relevant for description but has not been used to avoid semantic distractions §  Non-English Text overlays/subtitles hard to understand §  Cultural differences in reception of events/scene content §  Finding a neutral scene description appears as a challenging task §  Well-known people in videos may have influenced (inappropriately) the description of scenes §  Specifying time of day (frequently) impossible for indoor-shots §  Description quality suffers from long annotation hours §  Some offline vines were detected §  A lot of vines with redundant or even identical content TRECVID 2016 5
  • 8. Annotation Statistics UID # annotations Ø (sec) (sec) (sec) # time (hh:mm:ss) 0 700 62.16 239.00 40.00 12:06:12 1 500 84.00 455.00 13.00 11:40:04 2 500 56.84 499.00 09.00 07:53:38 3 500 81.12 491.00 12.00 11:16:00 4 500 234.62 499.00 33.00 32:35:09 5 500 165.38 493.00 30.00 22:58:12 6 500 57.06 333.00 10.00 07:55:32 7 500 64.11 495.00 12.00 08:54:15 8 200 82.14 552.00 68.00 04:33:47 total 4400 98.60 552.00 09.00 119:52:49 TRECVID 2016 8
  • 9. TRECVID 2016 9 Samples of captions A B a dog jumping onto a couch a dog runs against a couch indoors at daytime in the daytime, a driver let the steering wheel of car and slip on the slide above his car in the street on a car on a street the driver climb out of his moving car and use the slide on cargo area of the car an asian woman turns her head an asian young woman is yelling at another one that poses to the camera a woman sings outdoors a woman walks through a floor at daytime a person floating in a wind tunnel a person dances in the air in a wind tunnel
  • 10. Run Submissions & Evaluation Metrics •  Up to 4 runs per set (for A and for B) were allowed in the Matching & Ranking subtask. •  Up to 4 runs in the Description Generation subtask. •  Mean inverted rank measured the Matching & Ranking subtask. •  Machine Translation metrics including BLEU (BiLingual Evaluation Understudy) and METEOR (Metric for Evaluation of Translation with Explicit Ordering) were used to score the Description Generation subtask. •  An experimental “Semantic Textual Similarity” metric (STS) was also tested. 10TRECVID 2016
  • 11. BLEU and METEOR •  BLEU [0..1] used in MT (Machine Translation) to evaluate quality of text. It approximate human judgement at a corpus level. •  Measures the fraction of N-grams (up to 4-gram) in common between source and target. •  N-gram matches for a high N (e.g., 4) rarely occur at sentence-level, so poor performance of BLEU@N especially when comparing only individual sentences, better comparing paragraphs or higher. •  Often we see B@1, B@2, B@3, B@4 … we do B@4. •  Heavily influenced by number of references available. TRECVID 2016 11
  • 12. METEOR •  METEOR Computes unigram precision and recall, extending exact word matches to include similar words based on WordNet synonyms and stemmed tokens •  Based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision •  This is an active area … CIDEr (Consensus-Based Image Description Evaluation) is another recent metric … no universally agreed metric(s) TRECVID 2016 12
  • 13. UMBC STS measure [0..1] •  We’re exploring STS – based on distributional similarity and Latent Semantic Analysis (LSA) … complemented with semantic relations extracted from WordNet TRECVID 2016 13
  • 14. Participants (7 out of 11 teams finished) Matching & Ranking Description Generation DCU ü  ü  INF(ormedia) ü  ü  Mediamill (AMS) ü  ü  NII (Japan + Vietnam) ü  ü  Sheffield_UETLahore ü  ü  VIREO (CUHK) ü  Etter Solutions ü  14TRECVID 2016 Total of 46 runs Total of 16 runs
  • 15. Task 1: Matching & Ranking 15TRECVID 2016 Person reading newspaper outdoors at daytime Three men running in the street at daytime Person playing golf outdoors in the field Two men looking at laptop in an office x 2000 x 2000 type A … and ... X 2000 type B
  • 16. Matching & Ranking results by run 16TRECVID 2016 0 0.02 0.04 0.06 0.08 0.1 0.12 mediamill_task1_set.B.run2.txt mediamill_task1_set.B.run4.txt mediamill_task1_set.B.run3.txt mediamill_task1_set.B.run1.txt mediamill_task1_set.A.run2.txt mediamill_task1_set.A.run3.txt mediamill_task1_set.A.run1.txt mediamill_task1_set.A.run4.txt vireo_fusing_all.B.txt vireo_fusing_flat.A.txt vireo_fusing_flat.B.txt vireo_fusing_average.B.txt vireo_fusing_all.A.txt vireo_fusing_average.A.txt vireo_concept.B.txt vireo_concept.A.txt etter_mandr.B.1 etter_mandr.B.2 etter_mandr.A.1 DCU.adapt.bm25.B.swaped.txt etter_mandr.A.2 DCU.adapt.bm25.A.swaped.txt INF.ranked_list.B.no_score.txt DCU.adapt.fusion.B.swaped.txt.txt DCU.adapt.fusion.A.swaped.txt.txt INF.ranked_list.A.no_score.txt INF.ranked_list.A.new.txt INF.ranked_list.B.new.txt NII.run-2.A.txt DCU.vines.textDescription.A.testing NII.run-4.A.txt NII.run-1.B.txt NII.run-1.A.txt DCU.vines.textDescription.B.testing DCU.fused.B.txt NII.run-3.B.txt NII.run-3.A.txt NII.run-2.B.txt Sheffield_UETLahore.ranklist.B.test Sheffield_UETLahore.ranklist.A.test NII.run-4.B.txt DCU.fused.A.txt Sheffield_UETLahore.ranklist.B.test Sheffield_UETLahore.ranklist.A.test INF.epoch-38.B.txt INF.epoch-38.A.txt MeanInvertedRank Submitted runs MediaMill Vireo Etter DCU INF(ormedia) NII Sheffield
  • 17. Matching & Ranking results by run 17TRECVID 2016 0 0.02 0.04 0.06 0.08 0.1 0.12 mediamill_task1_set.B.run2.txt mediamill_task1_set.B.run4.txt mediamill_task1_set.B.run3.txt mediamill_task1_set.B.run1.txt mediamill_task1_set.A.run2.txt mediamill_task1_set.A.run3.txt mediamill_task1_set.A.run1.txt mediamill_task1_set.A.run4.txt vireo_fusing_all.B.txt vireo_fusing_flat.A.txt vireo_fusing_flat.B.txt vireo_fusing_average.B.txt vireo_fusing_all.A.txt vireo_fusing_average.A.txt vireo_concept.B.txt vireo_concept.A.txt etter_mandr.B.1 etter_mandr.B.2 etter_mandr.A.1 DCU.adapt.bm25.B.swaped.txt etter_mandr.A.2 DCU.adapt.bm25.A.swaped.txt INF.ranked_list.B.no_score.txt DCU.adapt.fusion.B.swaped.txt.txt DCU.adapt.fusion.A.swaped.txt.txt INF.ranked_list.A.no_score.txt INF.ranked_list.A.new.txt INF.ranked_list.B.new.txt NII.run-2.A.txt DCU.vines.textDescription.A.testing NII.run-4.A.txt NII.run-1.B.txt NII.run-1.A.txt DCU.vines.textDescription.B.testing DCU.fused.B.txt NII.run-3.B.txt NII.run-3.A.txt NII.run-2.B.txt Sheffield_UETLahore.ranklist.B.test Sheffield_UETLahore.ranklist.A.test NII.run-4.B.txt DCU.fused.A.txt Sheffield_UETLahore.ranklist.B.test Sheffield_UETLahore.ranklist.A.test INF.epoch-38.B.txt INF.epoch-38.A.txt MeanInvertedRank Submitted runs ‘B’ runs (colored/ team) seem to be doing better than ‘A’ MediaMill Vireo Etter DCU INF(ormedia) NII Sheffield
  • 18. Runs vs. matches 18TRECVID 2016 0 100 200 300 400 500 600 700 800 900 1 2 3 4 5 6 7 8 9 10 Matchesnotfoundbyruns Number of runs that missed a match All matches were found by different runs 5 runs didn’t find any of 805 matches
  • 19. Matched ranks frequency across all runs 19TRECVID 2016 0 100 200 300 400 500 600 700 800 1 10 19 28 37 46 55 64 73 82 91 100 Numberofmatches Rank 1 - 100 Set ‘B’ 0 100 200 300 400 500 600 700 800 1 10 19 28 37 46 55 64 73 82 91 100 Numberofmatches Rank 1 - 100 Set ‘A’ Very similar rank distribution
  • 20. 20TRECVID 2016 Videos vs. Ranks 1 10 100 1000 10000 Rank Videos Top 10 ranked & matched videos (set A) 626 1816 1339 1244 1006 527 1201 1387 1271 324 #Video Id 10 000
  • 21. 21TRECVID 2016 Videos vs. Ranks 1 10 100 1000 Rank Videos Top 3 ranked & matched videos (set A) 1387 (Top 3) 1271 (Top 2) 324 (Top1) #Video Id
  • 22. Samples of top 3 results (set A) TRECVID 2016 22 #1271 a woman and a man are kissing each other #1387 a dog imitating a baby by crawling on the floor in a living room #324 a dog is licking its nose
  • 23. 23TRECVID 2016 Videos vs. Ranks 1 10 100 1000 10000 Rank Videos Bottom 10 ranked & matched videos (set A) 220 732 1171 481 1124 579 754 443 1309 1090 #Video Id 10 000
  • 24. 24TRECVID 2016 Videos vs. Ranks 1 10 100 1000 10000 Rank Videos Bottom 3 ranked & matched videos (set A) 220 732 1171 #Video Id 10 000
  • 25. Samples of bottom 3 results (set A) TRECVID 2016 25 #1171 3 balls hover in front of a man #220 2 soccer players are playing rock-paper-scissors on a soccer field #732 a person wearing a costume and holding a chainsaw
  • 26. 26TRECVID 2016 Videos vs. Ranks 1 10 100 1000 10000 Rank Videos Top 10 ranked & matched videos (set B) 1128 40 374 752 955 777 1366 1747 387 761 #Video Id 10 000
  • 27. 27TRECVID 2016 Videos vs. Ranks 1 10 100 1000 10000 Rank Videos Top 3 ranked & matched videos (set B) 1747 387 761 #Video Id 10 000
  • 28. Samples of top 3 results (set B) TRECVID 2016 28 #761 White guy playing the guitar in a room #387 An Asian young man sitting is eating something yellow #1747 a man sitting in a room is giving baby something to drink and it starts laughing
  • 29. 29TRECVID 2016 Videos vs. Ranks 1 10 100 1000 10000 Rank Videos Bottom 10 ranked & matched videos (set B) 1460 674 79 345 1475 605 665 414 1060 144 #Video Id 10 000
  • 30. 30TRECVID 2016 Videos vs. Ranks 1 10 100 1000 10000 Rank Videos Bottom 3 ranked & matched videos (set B) 414 1060 144 #Video Id 10 000
  • 31. Samples of bottom 3 results (set B) TRECVID 2016 31 #144 A man touches his chin in a tv show #1060 A man piggybacking another man outdoors #414 a woman is following a man walking on the street at daytime trying to talk with him
  • 32. Lessons Learned ? •  Can we say something about A vs B •  At the top end we’re not so bad … best results can find the correct caption in almost top 1% of ranking TRECVID 2016 32
  • 33. Task 2: Description Generation TRECVID 2016 33 “a dog is licking its nose” Given a video Generate a textual description Metrics •  Popular MT measures : BLEU , METEOR •  Semantic textual similarity measure (STS). •  All runs and GT were normalized (lowercase, punctuations, stop words, stemming) before evaluation by MT metrics (except STS) Who ? What ? Where ? When ?
  • 34. BLEU results 0 0.005 0.01 0.015 0.02 0.025 BLEUscore Overall system scores TRECVID 2016 34 INF(ormedia) Sheffield NII MediaMill DCU
  • 35. BLEU stats sorted by median value 0 0.2 0.4 0.6 0.8 1 1.2 BLEUscore BLEU stats across 2000 videos per run Min Max Median TRECVID 2016 35
  • 36. METEOR results 0 0.05 0.1 0.15 0.2 0.25 0.3 METEORscore Overall system score TRECVID 2016 36 INF(ormedia) Sheffield NII MediaMill DCU
  • 37. METEOR stats sorted by median value 0 0.2 0.4 0.6 0.8 1 1.2 METEORscore METEOR stats across 2000 videos per run Min Max Median TRECVID 2016 37
  • 38. Semantic Textual Similarity (STS) sorted by median value 0 0.2 0.4 0.6 0.8 1 1.2 STSscore STS stats across 2000 videos per run Min Max Median TRECVID 2016 38 ‘A’ runs seems to be doing better than ‘B’Mediamill(A) INF(A) Sheffield_UET(A) NII(A) DCU(A)
  • 39. STS(A, B) Sorted by STS value TRECVID 2016 39
  • 40. An example from run submissions – 7 unique examples 1.  a girl is playing with a baby 2.  a little girl is playing with a dog 3.  a man is playing with a woman in a room 4.  a woman is playing with a baby 5.  a man is playing a video game and singing 6.  a man is talking to a car 7.  A toddler and a dog TRECVID 2016 40
  • 41. Participants •  High level descriptions of what groups did from their papers … more details on posters TRECVID 2016 41
  • 42. Participant: DCU Task A: Caption Matching •  Preprocess 10 frames/video to detect 1,000 objects (VGG-16 CNN from ImageNet), 94 crowd behaviour concepts (WWW dataset), locations (Place2 dataset on VGG16) •  4 runs, baseline BM25, Word2vec, and fusion Task B: Caption Generation •  Train on MS-COCO using NeuralTalk2, a RNN •  One caption per keyframe, captions then fused TRECVID 2016 42
  • 43. Participant: Informedia Focus on generalization ability of caption models, ignoring Who, What, Where, When facets Trained 4 caption models on 3 datasets (MS-COCO, MS- VD, MSR-VTT), achieving sota on those models based on VGGNet concepts and Hierarchical Recurrent Neural Encoder for temporal aspects Task B: Caption Generation •  Results explore transfer models to TRECVid-VTT TRECVID 2016 43
  • 44. Participant: MediaMill Task A: Caption Matching Task B: Caption Generation TRECVID 2016 44
  • 45. Participant: NII Task A: Caption Matching •  3DCNN for video representation trained on MSR-VTT + 1970 YouTube2Text + 1M captioned images •  4 run variants submitted, concluding the approach did not generalise well on test set and suffers from over-fitting Task B: Caption Generation •  Trained on 6500 videos from MSR-VTT dataset •  Confirmed that multimodal feature fusion works best, with audio features surprisingly good TRECVID 2016 45
  • 46. Participant: Sheffield / Lahore Task A: Caption Matching Did some run Task B: Caption Generation •  Identified a variety of high level concepts for frames •  Detect and recognize faces, age and gender, emotion, objects, (human) actions •  Varied the frequency of frames for each type of recognition •  Runs based on combinations of feature types TRECVID 2016 46
  • 47. Participant: VIREO (CUHK) Adopted their zero-example MED system in reverse Used a concept bank of 2000 concepts trained on MSR- VTT, Flickr30k, MS-COCO and TGIF datasets Task A: Caption Matching •  4(+4) runs testing traditional concept-based approach vs attention-based deep models, finding deep models perform better, motion features dominate performance TRECVID 2016 47
  • 48. Participant: Etter Solutions Task A: Caption Matching •  Focused on concepts for Who, What, When, Where •  Used a subset of ImageNet plus scene categories from the Places database •  Applied concepts to 1 fps (frame per second) with sliding window, mapped this to “document” vector, and calculated similarity score TRECVID 2016 48
  • 49. Observations •  Good participation, good finishing %, ‘B’ runs did better than ‘A’ in matching & ranking while ‘A’ did better than ‘B’ in the semantic similarity. •  METEOR scores are higher than BLEU, we should have used CIDEr also (some participants did) •  STS as a metric has some questions, making us ask what makes more sense? MT metrics or semantic similarity ? Which metric measures real system performance in a realistic application ? •  Lots of available training sets, some overlap ... MSR-VTT, MS-COCO, Place2, ImageNet, YouTube2Text, MS-VD .. Some trained with AMT (MSR-VTT-10k has 10,000 videos, 41.2 hours and 20 annotations each !) •  What did individual teams learn ? •  Do we need more reference (GT) sets ? (good for MT metrics) •  Should we run again as pilot ? How many videos to annotate, how many annotations on each? •  Only some systems applied the 4-facet description in their submissions ? TRECVID 2016 49
  • 50. Observations •  There are other video-to-caption challenges like ACM MULTIMEDIA 2016 Grand Challenges •  Images from YFCC100N with captions in a caption- matching/prediction task for 36884 test images. Majority of participants used CNNs and RNNs •  Video MSR VTT with 41.2h, 10000 clips each with x20 AMT captions … evaluation measures BLEU, METEOR, CIDEr and ROUGE-L ... GC results do not get aggregated and disssipate at the ACM MM Conference, so hard to gauge. TRECVID 2016 50