Opening up audiovisual archives for media professionals and researchers

Opening
up
Audiovisual
Archives

for
Professionals
and
Researchers

Tinne
Tuytelaars

07/1/2014

www.axes-‐project.eu

Overview

•  The
AXES
project

•  Audiovisual
content
analysis

•  Demo
systems:

–  AXES-‐PRO

–  AXES-‐RESEARCH

•  Lessons
learned,
future
work

AXES
OVERVIEW

Hilversum-‐
31/1/2013


Current
situaRon

•  A
huge
wealth
of
audiovisual
data

•  Lots
of
data
in
archive
virtually
irretrievable

•  Search/indexing
based
on
metadata

manually
entered
by
the
archivists


Some
causes
of
irretrievable
data

1.  Old
material
from
pre-‐digital
era

2.  Metadata
provided
at
document-‐level

3.  Ideal
annotaRons
do
not
exist

–  Diﬀerent
uses

–  Diﬀerent
users


Developing
tools
providing

A ( u s e r / c e n t e e d ( a t p nteract

new
engaging
rways
p o
rio a c h
Our$aim$is$to$open$up$audiovisual$digital$libraries,$increasing$their$cul=
with
audiovisual
libraries…

Not
just
search:

an$early$stage.$Targeted$users$are$media$professionals,$educators,$stu=
dents,$amateur$researchers$and$home$users.

Browse

Explore

Experience


...
for
a
mulRtude
of
end
users...

Media
professionals
(Y1-‐Y2)

Researchers
(Y2-‐Y3)

Home
users
(Y3-‐Y4)

!

!


d to define
on. Simiconstraints
fier. These
diction and
approaches
n addition,
d to define

eline to ex-

…
building
on
state-‐of-‐the-‐art

content
analysis
techniques…

Computer
vision

Speech
recogniRon
/

audio
analysis

Figure 1. An overview of our processing pipeline. (a) A face detector is applied to each video frame. (b) Face tracks are created
by associating face detections. (c) Facial points are localized. (d)
Locally SIFT appearance descriptors are extracted on the facial
features, and concatenated to form the final face descriptor.

Search
&
navigaRon

verviewFig- processing pipeline. (a) A face de1, see of our
d to present
to each video frame. (b) Face tracks1. We then extract SIFT descriptors at
nose, see Figure are created
ace the exdetections. (c) Facial points are localized. (d)
om
these nine locations at three different scales, which we conppearance
the facial
ack identi-descriptors are extracted on feature vector f ∈ IRD of dimension
catenate to form a
ncatenated to form the final face descriptor.

D = 3 × 9 × 128 = 3456. As the descriptors are computed
at facial feature points, it is robust to pose and expression
Weakly-‐supervised

ure 1. We thenchanges. SIFT descriptorsdescriptorethods
robust to
extract Using the SIFT at m makes it also
small errors in localization.
first use a
ations at three different scales, which we conen link the vector f ∈ IRD of dimension
rm a feature
3.2. Metric learning from face tracks
proach for
128 = 3456. As the descriptors are computed
ncontrolledit is robust to pose andface tracks we can extract face pairs from
re points,
Given a set of expression
ng the SIFT descriptorto learn it also robust to face descriptors in an unsuthem makes a metric over the


…
building
on
state-‐of-‐the-‐art

content
analysis
techniques…

People

Places

Events

Object
categories


…
with
a
user-‐centric
approach…


…
working
along
3
axes:

•  Users

•  Technology

•  Content

AXES
OVERVIEW

Hilversum-‐
31/1/2013


Overview

•  The
AXES
project

•  Audiovisual
content
analysis

•  Demo
systems:

–  AXES-‐PRO

–  AXES-‐RESEARCH

•  Lessons
learnt,
future
work

AXES
OVERVIEW

Hilversum-‐
31/1/2013


Processing
pipeline

MulRmodal

query

Query

Language

LIMAS

Filter

Sigmoid

normalize

Category

Search

Instance

Search

Person

Clustering

Face

Search

Similarity

Search

MulRmedia

Event

DetecRon

Face

Similarity

Search

Pose

Retrieval

ASR/
Meta

…


Processing
pipeline

Data
Gathering

Metadata

Normalisation
Video

Normalisation

Shot

Extraction

Subtitle

Normalisation

Face
Detection

and
Tracking

Speech

Processing

Merge

Named
Entity

Extraction

Data
Exposer

VISOR
Indexer

Link

Management

Place

PresentaRon
Rtle
Indexer
LocaRon
-‐
Date

Face
Indexer


On-‐the-‐ﬂy
Visual
Category
Search

Pre-‐computed

features

(NISV/

TRECVID)

~100
posiRve
(‘present’)

training
images

Visual

Model

~1000
negaRve
(‘not
present’)

training
images

Training
of
visual
model
from
mul$ple

images
allows
more
general
visual

searches
to
be
made

e.g.
all
motorbikes
and
not
just
Harleys


OTF
System
Outline


Visual
Category
Search
–
Examples

‘forest’
–
NISV

‘gothic
architecture’
–
NISV


On-‐the-‐ﬂy
Face
Search

ON-LINE
PROCESSING

Text Query
“Courteney Cox”

Negative Training
Images

Facial Features &
Descriptors

Google Image Search
“Courteney Cox”

OFF-LINE
PROCESSING

Facial Features &
Descriptors

Video Search Corpus
Fast Linear
Classifier

Ranking

Face Tracks
Facial Features &
Descriptors

Results


Face
Search
–
Feature
ComputaRon

Face
detecRon
and
facial
landmark
detecRon

Feature
region
descriptors

Faces
clustered
into
tracks

ConcatenaRon

Feature
Vector


On-‐the-‐ﬂy
Face
Search

Facial landmark
detection

Aligned and
cropped face

Dense SIFT,
GMM, and FV

PresentaRon
Rtle

LocaRon
-‐
Date

Discriminative
dim. reduction

Compact face
representation


New
Face
Descriptor

Face
descriptor
for
face
recogniRon

•  dense
sampling

•  relevant
face
parts
learnt
automaRcally

•  compact
and
discriminaRve

Current approach

(describe landmarks)

New
approach

(describe everything)


Face
Category
Search
–
Examples

‘George
Bush’
–
NISV

‘Spectacles’
–
NISV


On-‐the-‐ﬂy
Instance
Search

Text Query
“Earth”

ON-LINE
PROCESSING

OFF-LINE
PROCESSING

Precomputed
descriptors for the
entire corpus
Descriptors

Google Image Search
“Earth”

Inverted index for
fast search

Fast search

Reranking based on the
geometric configuration
of matched descriptors

Results


Instance
Search:

Feature
ComputaRon

Salient
image
regions

are
detected
and
a

descriptor
is

extracted
for
each

one


Instance
Search:

Geometric
veriﬁcaRon

SpaRal
arrangement
of
descriptors
is
used
to
verify
a
match


On-‐the-‐ﬂy
Instance
Search
-‐
variant

Text Query
“Earth”

ON-LINE
PROCESSING

OFF-LINE
PROCESSING

Precomputed
descriptors for the
entire corpus
Descriptors
Unsupervised
pattern mining
Google Image Search
“Earth”

Inverted index for
fast search

Fast search
Reranking based on the
geometric configuration
of matched descriptors

Results


On-‐the-‐ﬂy
Instance
Search
-‐
variant


Instance
Search
–
Examples

‘Golden
Gate
Bridge’

–
TRECVid
2012
INS

‘Red
Bull’

–
TRECVid
2012
INS


Instance
Search
–
Examples

‘
US
Army
Corps
of
Engineers
logo’

–
TRECVid
2012
INS

‘Eiﬀel
tower’

–
TRECVid
2012
INS


Instance
Search
–
Examples

‘
US
Capitol’

–
TRECVid
2012
INS

‘Obama
logo’

–
TRECVid
2012
INS


Pretrained
classifiers

•  On-‐the-‐fly
is
very
powerful,
but
takes
Rme

•  Compromise
between
speed
and
accuracy

•  For
common
queries:
bejer
use
models
that

have
been
trained
offline

•  Cache
results

•  Imagenet
categories,
celebriRes,
…


MulRmedia
Event
detecRon

Short
acRon,
i.e.
answer
phone,
hand
shake

answer
phone

hand
shake


MulRmedia
Event
detecRon

•  AcRviRes/events,
i.e.
birthday
party,
parade

Birthday
party

Parade

TrecVid
MulR-‐media
event
detecRon
dataset


PresentaRon
Rtle

LocaRon
-‐
Date


Similarity
search
Faces

–  Face
detecRon
in
query
image
(or
user-‐assisted
face
selecRon)

–  Landmarks
detecRon
and
HOG
descriptors
computaRon

–  Matching
using
a
metric
learned
oﬄine

Query

Database

Answer

MeeRng
in
Leuven,
17-‐18
Oct.
2013

35


AXES
PRO

PresentaRon
Rtle

LocaRon
-‐
Date


User
Requirements

PRO
vs.
RESEARCH

• 
• 
• 
• 
• 
• 
• 
• 

Browsing
and
Linking

Tagging
and
AnnotaRon

Search
trails

Metadata
export

CustomizaRon
and
personalizaRon

CollaboraRon

Usage
staRsRcs

Thesaurus

PresentaRon
Rtle

LocaRon
-‐
Date


AXES
RESEARCH

PresentaRon
Rtle

LocaRon
-‐
Date


AXES
RESEARCH@
TRECVID

WP7

Bonn,
Germany-‐
15
Oct
2013


TRECVID
INS

•  Results

–  About
average
in
terms
of
MAP

–  Good
for
P@10,
P@30,
P@100

•  Issues

–  Duplicated
shots
from
SBD

WP7

Bonn,
Germany-‐
15
Oct
2013


P@10

I_NO_orand-interactive_2
I_NO_FTRDBJ_4
I_NO_AXES_1_1
F_NO_orand-1sift_4
F_NO_orand-2sift_3
F_NO_orand-graph_1
I_NO_PKU-ICST-MIPL_2
F_NO_NII-GeoRerank_Cai-Zhi_1
F_NO_NTT_NII_3
F_NO_NTT_NII_2
I_NO_AXES_3_3
F_NO_PKU-ICST-MIPL_3
F_NO_NII-AsymDis_Cai-Zhi_2
F_NO_NII-AvgDist_Cai-Zhi_3
I_NO_AXES_2_2
F_NO_vireo_dtc_1
F_NO_NTT_NII_4
F_NO_NTT_NII_1
F_NO_vireo_dtcm_3
F_NO_NII-UIT.BOW-DPM_4
F_NO_MMWukong_1
F_NO_vireo_dtm_4
F_NO_FTRDBJ_2
F_NO_FTRDBJ_1
F_NO_vireo_dt_2
F_NO_MMShaseng_3
F_NO_TNOM3-SHOTBFSIFT_1
F_NO_MMBajie_2
F_NO_IMP.hash2_2
F_NO_IMP.hash3_3
F_NO_IMP.hash1_1
F_NO_IMP.hash4_4
F_NO_CeaList_1
F_NO_iAD_DCU_3
F_NO_iAD_DCU_4
F_NO_iAD_DCU_2
F_NO_iAD_DCU_1
F_NO_ridl_m_2q_1
F_NO_CeaList_2
F_NO_ridl_m_q_3
F_NO_CeaList_3
F_NO_CeaList_4
F_NO_ridl_m_2q_r_2
F_NO_ARTEMIS_3_3
F_NO_MMShifu_4
F_NO_FTRDBJ_3
F_NO_ridl_m_q_r_4
F_NO_jrs2_2
F_NO_jrs1_1
F_NO_ARTEMIS_4_4
F_NO_ARTEMIS_1_1
F_NO_ARTEMIS_2_2
I_NO_ITI_CERTH_1
F_NO_BUPT_MCPRL_3
F_NO_BUPT_MCPRL_2
I_NO_ITI_CERTH_3
I_NO_ITI_CERTH_2
F_NO_IRIM_2_2
F_NO_TokyoTech2_2
F_NO_TokyoTech1_1
F_NO_NERCMS_2
F_NO_NERCMS_4
F_NO_NERCMS_3
F_NO_IRIM_3_3
F_NO_MIC_TJ_1
F_NO_IRIM_1_1
F_NO_NERCMS_1
F_NO_IRIM_4_4
F_NO_MIC_TJ_2
F_NO_sheffield_2
F_NO_sheffield_1
F_NO_sheffield_3
0.0

PresentaRon
Rtle

2.5
LocaRon
-‐
Date
P10

5.0

7.5


P@30

I_NO_FTRDBJ_4
I_NO_AXES_1_1
F_NO_orand-graph_1
F_NO_orand-2sift_3
F_NO_NTT_NII_3
F_NO_NTT_NII_2
F_NO_orand-1sift_4
F_NO_NTT_NII_1
I_NO_AXES_3_3
I_NO_AXES_2_2
F_NO_vireo_dtc_1
F_NO_NTT_NII_4
F_NO_MMWukong_1
F_NO_vireo_dtcm_3
F_NO_MMShaseng_3
F_NO_FTRDBJ_1
F_NO_vireo_dt_2
F_NO_MMBajie_2
F_NO_FTRDBJ_2
F_NO_vireo_dtm_4
F_NO_iAD_DCU_3
F_NO_iAD_DCU_4
F_NO_IMP.hash4_4
F_NO_IMP.hash1_1
F_NO_iAD_DCU_2
F_NO_IMP.hash3_3
F_NO_IMP.hash2_2
F_NO_iAD_DCU_1
F_NO_CeaList_1
F_NO_ridl_m_2q_1
F_NO_ridl_m_q_3
F_NO_ridl_m_2q_r_2
F_NO_ARTEMIS_3_3
F_NO_CeaList_2
F_NO_CeaList_3
F_NO_ridl_m_q_r_4
F_NO_MMShifu_4
F_NO_CeaList_4
F_NO_ARTEMIS_4_4
F_NO_ARTEMIS_1_1
F_NO_jrs2_2
F_NO_jrs1_1
F_NO_ARTEMIS_2_2
F_NO_FTRDBJ_3
I_NO_ITI_CERTH_2
F_NO_BUPT_MCPRL_2
F_NO_IRIM_2_2
F_NO_BUPT_MCPRL_3
I_NO_ITI_CERTH_1
I_NO_ITI_CERTH_3
F_NO_TokyoTech1_1
F_NO_TokyoTech2_2
F_NO_IRIM_3_3
F_NO_IRIM_1_1
F_NO_NERCMS_1
F_NO_MIC_TJ_1
F_NO_NERCMS_4
F_NO_NERCMS_3
F_NO_NERCMS_2
F_NO_IRIM_4_4
F_NO_MIC_TJ_2
F_NO_sheffield_1
F_NO_sheffield_2
F_NO_sheffield_3
0

PresentaRon
Rtle

5
10
LocaRon
-‐
Date
P30

15

20


P@100

I_NO_FTRDBJ_4
F_NO_NTT_NII_3
F_NO_NTT_NII_2
I_NO_AXES_1_1
F_NO_orand-graph_1
F_NO_NTT_NII_1
F_NO_orand-2sift_3
F_NO_orand-1sift_4
F_NO_FTRDBJ_1
F_NO_NTT_NII_4
F_NO_vireo_dtc_1
F_NO_FTRDBJ_2
F_NO_vireo_dtcm_3
F_NO_MMWukong_1
F_NO_iAD_DCU_3
F_NO_iAD_DCU_4
F_NO_vireo_dt_2
F_NO_iAD_DCU_1
F_NO_iAD_DCU_2
F_NO_vireo_dtm_4
F_NO_MMShaseng_3
F_NO_MMBajie_2
I_NO_AXES_2_2
I_NO_AXES_3_3
F_NO_IMP.hash4_4
F_NO_IMP.hash1_1
F_NO_IMP.hash2_2
F_NO_IMP.hash3_3
F_NO_ridl_m_2q_r_2
F_NO_ridl_m_2q_1
F_NO_ARTEMIS_3_3
F_NO_ridl_m_q_r_4
F_NO_ridl_m_q_3
F_NO_MMShifu_4
F_NO_CeaList_1
F_NO_ARTEMIS_1_1
F_NO_ARTEMIS_4_4
F_NO_CeaList_4
F_NO_CeaList_3
F_NO_CeaList_2
F_NO_ARTEMIS_2_2
F_NO_jrs2_2
F_NO_jrs1_1
F_NO_BUPT_MCPRL_3
F_NO_IRIM_2_2
F_NO_BUPT_MCPRL_2
F_NO_FTRDBJ_3
I_NO_ITI_CERTH_2
I_NO_ITI_CERTH_1
F_NO_TokyoTech2_2
F_NO_IRIM_1_1
F_NO_TokyoTech1_1
F_NO_IRIM_3_3
I_NO_ITI_CERTH_3
F_NO_NERCMS_1
F_NO_IRIM_4_4
F_NO_NERCMS_4
F_NO_NERCMS_3
F_NO_NERCMS_2
F_NO_MIC_TJ_1
F_NO_MIC_TJ_2
F_NO_sheffield_1
F_NO_sheffield_2
F_NO_sheffield_3
0

PresentaRon
Rtle

20
LocaRon
-‐
Date
P100

40

60


MAP

F_NO_NTT_NII_3
I_NO_FTRDBJ_4
F_NO_NTT_NII_2
F_NO_NTT_NII_1
F_NO_NTT_NII_4
F_NO_vireo_dtc_1
F_NO_vireo_dtcm_3
F_NO_FTRDBJ_1
F_NO_orand-graph_1
F_NO_orand-1sift_4
F_NO_FTRDBJ_2
F_NO_orand-2sift_3
F_NO_iAD_DCU_3
F_NO_vireo_dt_2
F_NO_vireo_dtm_4
F_NO_iAD_DCU_1
F_NO_iAD_DCU_2
F_NO_iAD_DCU_4
I_NO_AXES_1_1
F_NO_MMWukong_1
F_NO_MMShaseng_3
F_NO_IMP.hash2_2
F_NO_IMP.hash3_3
F_NO_IMP.hash1_1
F_NO_IMP.hash4_4
F_NO_MMBajie_2
F_NO_ARTEMIS_3_3
I_NO_AXES_3_3
I_NO_AXES_2_2
F_NO_ARTEMIS_4_4
F_NO_ridl_m_2q_1
F_NO_ridl_m_2q_r_2
F_NO_ARTEMIS_1_1
F_NO_MMShifu_4
F_NO_ridl_m_q_3
F_NO_CeaList_1
F_NO_ridl_m_q_r_4
F_NO_ARTEMIS_2_2
F_NO_CeaList_4
F_NO_CeaList_2
F_NO_CeaList_3
F_NO_FTRDBJ_3
F_NO_jrs2_2
F_NO_jrs1_1
F_NO_BUPT_MCPRL_3
F_NO_BUPT_MCPRL_2
F_NO_IRIM_2_2
I_NO_ITI_CERTH_2
I_NO_ITI_CERTH_1
F_NO_TokyoTech2_2
F_NO_IRIM_3_3
F_NO_TokyoTech1_1
I_NO_ITI_CERTH_3
F_NO_IRIM_1_1
F_NO_NERCMS_4
F_NO_NERCMS_2
F_NO_NERCMS_1
F_NO_NERCMS_3
F_NO_IRIM_4_4
F_NO_MIC_TJ_1
F_NO_MIC_TJ_2
F_NO_sheffield_2
F_NO_sheffield_1
F_NO_sheffield_3
0.0

PresentaRon
Rtle

0.1
LocaRon
-‐
Date
mAP

0.2

0.3


Time
spent
by
users

I_NO_AXES_3_3

I_NO_AXES_2_2

I_NO_AXES_1_1



I_NO_ITI_CERTH_3

I_NO_ITI_CERTH_2

I_NO_ITI_CERTH_1

I_NO_FTRDBJ_4

0

PresentaRon
Rtle

5
LocaRon
-‐
Date

time (mins)

10

15


AXES
RESEARCH@
TRECVID


TrecVid
MED
2012
-‐
results

•  CombinaRon
of
MBH,SIFT,
audio
and
text
recogniRon

•  AggregaRon
with
Fisher
vector

•  First
in
adhoc
event
task,
second
in
known
event
task

Making sandwich – results

Rank 1

Rank 19

Rank 20

TrecVid
MED
2012
-‐
results

FlashMob gathering – results

Rank 1 (pos)

Rank 18 (pos)

Rank 19 (neg)


Lessons
learnt

•  On-‐the-‐fly
model
learning
gives
great

flexibility.
High
potenRal
especially
when

combined
with
offline
learning
/
lifelong

learning.

•  The
more
data,
the
bejer.

•  It’s
important
to
educate
your
users.

•  A
good
user
interface
is
important.

•  MulRmodal
and
spaRo-‐temporal
processing:

easier
said
than
done.


AXES
HOME

What
do
home
users
want
?

–  Google-‐like
search
?

–  Have
fun
?

PresentaRon
Rtle

LocaRon
-‐
Date


Immediate
Pose
Retrieval

N.
Jammalamadaka,
A.
Zisserman,
M.
Eichner,
V.
Ferrari,
C.
V.
Jawahar

Video
Retrieval
by
Mimicking
Poses

ACM
InternaRonal
Conference
on
MulRmedia
Retrieval,
2012


Examples


AXES
HOME

What
do
home
users
want
?

search
?

–  Have
fun
?

–  Browse
?

PresentaRon
Rtle

LocaRon
-‐
Date


Linking

•  What
are
interesRng
links
?

•  SemanRcs,
semanRcs,
semanRcs

•  Not
too
similar,
not
too
diﬀerent


AXES
HOME

What
do
home
users
want
?

search
?

–  Have
fun
?

–  Browse
?

–  Suggest
content
to
user
based
on
his/her

behavior
?


QuesRons
?


Opening up audiovisual archives for media professionals and researchers

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (20)

Similar a Opening up audiovisual archives for media professionals and researchers

Similar a Opening up audiovisual archives for media professionals and researchers (20)

Más de MediaMixerCommunity

Más de MediaMixerCommunity (19)

Último

Último (20)

Opening up audiovisual archives for media professionals and researchers