TAAI 2016 Keynote Talk: It is all about AI

1
It is all about AI
Mark Liao
Institute of Information Science
Academia Sinica, Taiwan
(TAAI 2016)

Contents of this talk
• Automatic Concert Video Mashup
• Spatio-Temporal Learning of Basketball
Offensive Strategies
2

1
Automatic Concert Video
Mashup
Mark Liao
Institute of Information Science
Academia Sinica, Taiwan

What is concert video mashup ?
• A concert video mashup process is to
deal with all videos captured from
different locations of a concert hall and
convert them into a complete, non-
overlapping, seamless, and high-quality
outcome.
4

Why concert video mashup ?
• To provide people who could not attend live concert a
second chance to enjoy the performance with similar
quality.
5

Many problems to be solved !
• Videos were captured with no coordination,
incompleteness or redundancy happens
always.
• The order to watch these videos often causes
confusion.
• These videos were captured by handheld
devices, their visual/audio quality cannot be
guaranteed.
6

Issues need to be addressed
• The order to watch
• Visual quality optimization
• Seamless sound track connection
• No redundancy
• No missing video segments
• Mashup results follow the rules defined by
language of film
7

Potential Issues: The order to watch(1/5)
• Three video clips captured from 3 different
angles, different distances, 1&2 partially
overlapped, 3 independent
8
1 2
3

Potential Issues: Multiple audio sequence
alignment (2/5)
Case 1: partially overlapped
Case 2: no overlap
9

Potential Issues(3/5)
• Among three videos coherent in time, which
one should be chosen ? (3 different locations)
-- follow the rules of language of film !
10
Medium
Shot
Long
Shot
Extreme Long
Shot

• Among several qualified videos clips, which
one should be chosen ? Same distance !
-- visual quality ? audio quality ?
11
Potential Issues(4/5)
Extreme Long
Shot
Extreme Long
Shot

Potential Issues(5/5 )
• How to present the emotion, ideas, and art of
a music director into a concert video mashup
process ? Can a CNN learn facial emotion ?
12

Previous Effort
• The closest research area to ``automatic video
mashup’’ is ``summarizations of multi-view
videos’’
• The objective of the latter is to produce a
reduced set of abstracted videos or key-frame
sequence that can represent the most
prominent parts of the input videos.
13

Literatures related to video mashup
(1/3)
• [Shrestha et al.] formulate video mashup as an
optimization problem
- pros – optimizing visual quality and
diversity constraints
- cons – did not take into account
professional view of a visual storytelling director
P. Shrestha et al., automatic mashup generation from multiple-
camera concert recordings, ACM MM, 2010.
14

(2/3)
• [Wu et al.] put some pre-defined rules to solve
the frequent-shot-change problem
- pros – can solve part of the shot change
problem
- cons – did not involve a visual storytelling
director to instruct a video mashup process
Wu et al., MoVieUp: Automatic mobile video mashup, IEEE TCSVT,
2015.
15

(3/3)
• [Saini et al.] introduce visual storytelling rules by dividing
audience seats into six shooting locations and then
calculate statistics of shot transition and length from
professionally edited videos
- pros – a good start by introducing the views of
professional experts
- cons – shot types defined by themselves, not by rules
defined in language of film
Saini et al., MoViMash: Online mobile video mashup, ACM MM, 2012.
16

Introduction
• An experienced movie director frequently use
camera work practice in visual storytelling.
Intro Verse
Verse Chorus
Chorus Bridge
Bridge
. . .
16

Introduction
• Applications
– Mashup
– Emotion (music video)
18

Introduction
• According to the language of film [3], shot size
is one of the basics of filmmaking.
19
Long Shot Close-Up

Introduction
20
• The definition of six
types of shots [3].

Introduction
• Definition from the
language of film [3], a
concert video contains
eight types of camera
shots.
20
Musical Instrument Shot (MIS)Audience Shot (ADS)

INTRODUCTION
• Two images from an official concert video of the song “93
million miles” by Jason Mraz live at Hong Kong 2012.
22

System Framework for Video Mashup
23

Shot Classification based on
EW-Deep-CCM
• Error-Weighted Deep Cross-Correlation Model
24

Object Representation (VGG-Net)
• Object representation using a 16-layer VGG-Net
• we extract features from the output layer and the two fully-
connected layers as the object representations, the feature
dimensions are 1000-D, 4096-D and 4096-D, respectively.
25

Object Representations (1/2)
ImageNet1000
object representation
26

Object Representations (2/2)
27

Literatures related to Fusion Strategy
• Early fusion
– Pros:
Take the advantage of combining various feature cues
– Cons:
High dimensional feature set may easily suffer from the
problem of data sparseness, and stress the computational
resources.
28

Literatures related to Fusion Strategy
• Late fusion
– Pros:
Without increasing the dimensionality
Interpret the performance of different classifiers and gain insight
into the role of multiple modalities during emotional expression
– Cons:
The assumption of conditional independence among multiple
modalities is inappropriate.
29

Shot Classification based on
EW-Deep-CCM
• A novel fusion strategy named Error Weighted Deep Cross-Correlation
Model (EW-Deep-CCM) is proposed to effectively combine the extracted
multilayer object representations.
30

Experimental Results
• Comparison of Shot Type Classification (other
method)
31

• EW-Deep-CCM only achieves 83% detection
rate
• 17% error remain, i.e., 1/6 error rate, this will
cause frequent shot changes
32

17% error rate causes too many
shot changes
31

Conditional Random Field-based (CRF)
Approach
• 1st trial: 30-frame fixed window size (not a systematic way to
smooth the results)
• 2nd trial: Recurrent Neural Network (RNN)
-- Problem: RNN needs pre-segmented data to derive best results,
but the shot type classification results generated are not well
segmented
• 3rd trial: Conditional Random Field (CRF)
34

OUR METHOD – Coherent-Net
Shot Type
Refinement
(CRF)
35

OUR METHOD – Coherent-Net
Framework
Shot Type
Refinement
(CRF)
( | ')P w w
( | )P w O
'
1
( | )= ( , '| )
( | ') ( '| )
( | ') ( '| )
N
n n
n
P P
P P
P P w o
=
≈ ⋅
≈ ⋅
∑
∏
w
w O w w O
w w w O
w w
CRF EW-Deep-CCM
( '| )P w O
36

(EW-Deep-CCM)
Likelihood
(DNN posterior
probability)
Cross-correlation Empirical
weight
1 1 1
( '| ) ( '| , ) ( | ) ( | , ) ( | , )
( | , ) ( | , ) ( | )
C D K
out out fc out
ij k k ij i i k j i k
i j k
out fc fc fc
i j k j j k ij ij
P w o P w w P w P o w P w
P w P o w P o
β α
α β
= = =
≈ Λ Λ Λ
× Λ Λ
∑∑∑
 
 
   
 
Shot Type
Refinement
(CRF)
( | ')P w w
( | )P w O( '| )P w O
37

1 1' ', , ', 't tw w w−=w 
1w=w 2w 3w 1tw − tw
( )
( )
1
( | ') exp , '
'
j
j
P F
 
=  
 
∑w w w w
Z w
( ) ( )' exp , 'j
j
F
 
=  
 
∑ ∑w
Z w w w
( )
( ) ( )1
1
exp , , ' , '
'
j j t t j j t
t j t j
t w w s wλ µ−
 
∝ + 
 
∑∑ ∑∑w w
Z w
( ) 1{ } { } { } { ' }
,
1
exp
' t t t tmn w m w n om w m w o
t m n S t m S o O
λ µ−= = = =
∈ ∈ ∈
 
∝ + 
 
∑ ∑ ∑∑∑1 1 1 1
Z w
( )
1
, '
0
j ts w

= 

w
when and't
w o= t
w m=
otherwise
State-observation pairState transition
( )1
1
, , '
0
j t tt w w−

= 

w
when and 1t
w n−
=t
w m=
otherwise
(CRF)
unary potentialpairwise potential
CLCCCC
CCCCCC
38

EXPERIMENTS – Official Demo 1
39
• the song “Skyfall” by Adele perform at Oscar 2013

EXPERIMENTS – Official Demo 2
• the song “When I was Your Man” by Bruno Mars
perform at BBC Radio 1's Big weekend 2013
40

41

Problem & Goal
• A concert video
mashup process needs
to align the videos
taken by variant
audiences into a
common timeline.
42

Literature Review
• Audio fingerprinting
• Problems
– Originally designed for the problem of audio
identification rather than that of time alignment.
– Easily cause audio signal distortion
• Zhu et al. treat audio identification as an image
matching problem. (significant performance improvement)
• B. Zhu et al., “A novel audio fingerprinting method robust to time
scale modification and pitch shifting,” ACM MM, 2010.
43

Our Method
• We modified Zhu’s method to address the multiple
audio sequences alignment problem.
– Auditory image (spectrogram) construction
1-D audio signal (waveform) 2D auditory image
Time-frequency representation
(spectrogram)
Short-time
Fourier
transform
44

Our Method
– Audio Sequences Alignment
(1) Boundary candidate selection (based on SIFT alignment)
-where a is a SIFT feature in audio sequence A, b is the closest
feature of a in B, b’ is the second closet feature of a in B.
bA Ba
'
, ( , ) ( , )
,
Yes if D a b c D a b
BC
No otherwise
 < ∗
= 

BC: boundary candidate
D(.): Euclidean distance
c: a constant (c=0.7)
Yellow lines are
boundary candidates
45

Our Method
(2) Boundary candidate refinement.
-A window distortion measure (WDM) is defined for each
boundary candidate refinement.
46

Our Method
(3) Final boundary decision.
-The alignment result is determined by a refined boundary
candidate that with minimum window distortion.
47

DEMO 1
• “I’m Yours” by Jason Mraz live at Singapore 2012
– with context search (Aligned in 49.8001 s)
48
Time
Line00:00:00 00:00:49.8001
Recording #4
Recording #5
+0.4334 s

DEMO 2
• “All I Ask” by Adele live at Birmingham Genting Arena 2016
– with context search (Aligned in 53.2169 s)
49
Time
Line00:00:00 00:00:53.2169
Recording #1
Recording #2
+0.5502 s

TimeLine
00:00:00 00:00:52.4893 04:00:2277
00:00:52.7667 03:58:8667
Audience #1
Audience #2
Audience #3
Demo
- Multiple Audio Sequence Alignment Result
50

Learning Professional Recording Skill
51
Initial
Prbo.
Duration
(frames/shot)
Shot Transition
(prob.)
Shot Type
Refinement
(CRF)
Coherent-Net

52

Demo - Mashup Result
53
mr#1
mr#2
mr#3

1
Spatio-Temporal Learning
of Basketball Offensive
Strategies

Motivations
• To develop an automatic tactics analysis
tool for coaches, players, and general
publics.
• To develop a new technique that can
compete with existing tools, such as
sportVU, but with much lower price
55

Methodology Adopted
• To analyze group behavior directly from the
court-view of an NBA broadcast video
• Detect and track each offense player,
calculate their trajectories and map these
trajectories from court view to tactic board
for analysis
56

Motivation (3)
• Unknown Offense Video Clip
90% → Screen Cut
10% → Princeton

60
• 6 cameras above the court
• No close-up view
→ Unable to see the details of plays

61
SportVU videos Broadcast videos
Tracked data Tracked data
SportVU system Our tracking system
?

Extracting features from an offense video clip ?
• Automatic player detection
• Automatic player tracking
• Map extracted trajectories from basketball
court to tactic board
62

step 2: Derive correct player trajectories on
panorama court (3/3)
63

step 3: Map trajectories from panorama court to
tactic board
64

What’s next ?
–Tactics Analysis based on
spatiotemporal trajectories
of 5 offense players
65

A Two-Stage Un-supervised Clustering
for Tactic Analysis
• Stage-1: Un-supervised clustering of all available
tactics based on their mutual distances
• Stage-2: Un-supervised clustering of all tactics
clustered into the same cluster in Stage-1 (try to
separate the role of each offense player)
66

What techniques are needed ?
• A spatiotemporal model that can describe the
group behavior of 5 offense players
• Automatic clustering of group behaviors
(screen-cut, Princeton, wing-wheel, etc)
• Representation of each group behavior
• An appropriate metric to calculate the distance
between two arbitrary tactics.
67

Trajectory set Representation
S: the spatiotemporal matrix;
Pij=(xij,yij): 2D coordinate of the j-th player in the i-th frame;
Vj=[P1j P2j… PLj]T;
S=[V1 V2 V3 V4 V5 (V6)];

Distance Measure of Trajectory Set
• Problems
• Different time durations between 2 clips
• Ordering of column vectors

Trajectory Set Distance Matrix
S1=[V1 V2 V3 V4 V5]
S2=[U1 U2 U3 U4 U5]

Clustering by Dominant Set
PAMI 07. Massimiliano Pavan and Marcello Pelillo. Dominant Sets and Pairwise Clustering
Tactic1
Tactic2
Tactic3

Second-stage: how to model an offense strategy ?
• 8 different trajectory sets of right hawk, each consists of
5 trajectories generated by 5 offense players

Clustering by Trajectory Distance
• Based on the distance between trajectories, one can separate each
group of tactics into five group of trajectories, each corresponds to
a role (an offense player)
Hawk
Wing
Wheel
Princeton

Temporal Alignment
For each role, we use the velocities along x- and y-direction,
respectively, to model it (use DTW to solve the alignment
problem)

Demo _ Classification
Hawk
template

Princeton
template

Wing wheel
template

Thank you very much for
listening
79

TAAI 2016 Keynote Talk: It is all about AI

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a TAAI 2016 Keynote Talk: It is all about AI

Similar a TAAI 2016 Keynote Talk: It is all about AI (20)

Más de Yi-Shin Chen

Más de Yi-Shin Chen (9)

Último

Último (20)

TAAI 2016 Keynote Talk: It is all about AI