2. Outline
Problem Formulation
Current Solutions
Our Goal
Gory Details
Performance Evaluation
What’s Next?
Questions and Discussion
3. Problem Formulation
The digital video capture devices such as DVs are made
more affordable for end users.
It’s interesting to shoot videos but frustrating for editing
them.
There’s still a tremendous barrier between amateurs (home
users) and the powerful video editing software.
Finally people leave their precious shots in piles of DV
tapes without editing and management.
4. According to a survey on
DVworld*, the relations
between the video length
and how many times will
user review them after
days:
Video clips with no more
then 5 minutes are best for
human’s concentration.
Video length Review times
>= 1 hr 1 or 0
30 min ~ 1 hr 2 ~ 3
15 ~ 30 min 5 ~ 10
5 ~ 15 min >= 10
<= 5 min You take it out and
watch it when you
think about!
*http://www.DVworld.com.tw/
5. People are impatient for videos without scenario or voice-
over, especially for those with no music.
The improved soundtrack quality improved perceived video
image quality.
Synchronizing video and audio segments enhance the
perception of both.
One study at MIT showed that listener judge the identical
video image to be higher quality when accompanied by
higher-fidelity audio.
Facts about Musical Video
6. Home videos can be roughly classified by its nature
property.
Causal Shots within video are causal; changing the order of
shots may confuse the viewer
Non-causal Shots are not causal; it’s OK to re-order video shots
Recreational Videos are used to represent a kind of emotion or
enjoyment
Memorial Such as marriage or graduation celebrity, videos are
memorial and each shot should be preserved properly.
Four profiles are proposed to deal with videos of different
nature.
7. Current Solutions
A consumer product called “muvee autoProducer” has
been announced to ease the burden of professional video
editing.
It’s application scenario is quite simple:
Pick-up
your video
Choose your
favorite music
Produce a quality
musical video
Select profiles
to apply
8. Our Goal
Although there are commercial products in the market, only
few academic publications related.
Jonathan Foote, Matthew D. Cooper, Andreas Girgensohn,
"Creating music videos using automatic media analysis," ACM
Multimedia 2002: 553-560
The content-analysis technologies are developed for years;
can we adopt those technologies to help auto-creation of
musical videos?
Goal: To achieve the near or beyond quality in the similar
application scenario with the content-analysis technologies
developed in multimedia domain.
9. Input video
Input music
Shot change
Scene change
Audio
segment
cutting
Alignment
Output Video
Volume
ZCR
Brightness
Bandwidth
…
Human face
Flash light
Motion strength
Color variance
Camera Operation
...
Scene selection
Key shot selection
Audio rhythm &
Video motion/color
synchronization
Proposed Framework
10. Audio Analysis
We should cut the input audio into several clips according
to its audio features.
Frame-level features
Volume: defined as the MSR of audio samples
ZCR: the number of times that the audio waveform crosses the
zero axis in each frame.
Spectral features
Brightness: the centroid of frequency spectrum
Bandwidth: the standard deviation of frequency spectrum
11. Generally the brightness’ distribution curve is almost the same as ZCR
curve, so here we use ZCR feature only.
Bandwidth is an important audio feature but we can not easily tell
what’s the real physical meaning in music when the bandwidth reaches
its high/low value.
Furthermore, the relations between musical perceptual and bandwidth
values are not clear and not regular.
Brightness
ZCR
Volume
Bandwidth
12. Audio Segmentation
First we cut the input audio into clips when the volume
changes dramatically.
For each clip, we define the burst of ZCR as an “attack”,
which may be a beat of base drum or the singer’s voice.
>
=
otherwise,0
_)(,1
)(
thVCutiF
iA cut
cut
)()(
1
∑∑
+
+−
−=
wi
i
i
i
wi
i
cut
w
v
w
v
absiF
10/)max(_ ivthVCut =
×>
=
otherwise,0
)(2)(,1
)( iattack
attack
zstdiF
iA
)(
)()(
1
∑
∑
+
+
−
−
+−=
wi
i
i
i
i
wi
i
iattack
w
v
zabs
w
v
zabsiF
13. The dramatic volume change defines the audio clip
boundary, while the burst of ZCR (attack) in each clip
defines the granular sub-segment within it.
Clip boundary Attacks as sub-clip separation
Here we define the dynamic of each clip as:
)(
)(
ilen
z
iA j
j
dynamic
∑
=
孫燕姿
綠光
The dynamic feature can be used as a good
reference later for video/audio synchronization
14. Video Analysis
First we need to apply shot change detection to segment
video into scenes.
Here we use the combination of pixel MAD and pixel
histogram method to perform the shot change detection.
==
=
otherwise,0
1)(and1)(,1
)(
iSiS
iV HISTMAD
shot
Dhist < Thhist Dhist > Thhist
Dcolor < Thcolor nothing
Dcolor > Thcolor unsuitable! shot change!
15. Flashlight detection
The flashlight event will be detected as shot change.
When the shot change is founded, check if:
If so, then it’s a flashlight event, should not be treated as shot
change.
Sub-Shot segmentation
Here we use MPEG-7 ColorLayout descriptor to measure each
frame’s similarity.
The first frame in each shot is selected as the basis, each
consecutive frames are compared with the basis. If
Then we say that in frame i, a sub-shot is occurred.
thFlash
LMean
LMean
thFlash
LMean
LMean
i
i
i
i
_
)(
)(
&&_
)(
)(
11
≥≥
−+
ThSubSceneiDFFdistiD
i
k
k _)(,),()(
1
0 ≥= ∑=
16. Camera Operation
Camera operations such as pan or zoom are widely used in amateur
home videos. By detection those camera operations can help catch the
video taker’s intention.
Our camera operation detection is performed base on the MPEG
video’s motion vectors in P-frames.
Pan Zoom
31 ≤≤
∑
∑
i
i
v
v 3>
∑
∑
i
i
v
v
This method is simple and efficient. However, it does well when
detecting camera operations.
17. Video Features
Frame-level features
The presence of human faces.
Use OpenCV library as face detection module.
Motion intensity
Flashlight detection
Mean and standard deviation of luminance plane
(Dcolor(i) > Thcolor && Dhist(i) < Thhist) defines the unsuitable frames
Shot-level features
Numbers and types of camera operation in each shot.
Numbers of faces and flashlight event in each shot.
The accumulation of distance between each frame and first frame
can be used to describe the shot’s homogeneity.
18. Importance Measure
Frame-level score function:
)
256
130
(
)_(
)(
Std
Mean
opCameraSR
ERScore
amotion
flashface
+
−
×
++××
++×=
γ
β
α
}2,1,0{_,
)max(
}1,0{,
∈=
∈
×
=
opCamera
Motion
Motion
R
E
HW
Area
R
i
motion
flash
face
face
2.0,3.0,5.0 === γβα
The face and flashlight event have the highest weighting.
Camera operation and higher motion intensity represent the video
taker’s intension, so it’s more important.
Frames with higher luminance and larger standard derivation are more
suitable.
The penalty of unsuitable frames will be discussed later.
A scaling coefficient according to
synchronized audio clip’s feature
19. The shot-level importance is motivated by observing that:
Shots with larger motion intensity take longer duration.
The presence of face attracts viewer.
Shots of higher heterogeneity can taker longer playing time.
Shots with more camera operations are more important.
Of course, shots with longer length in origin are more important.
Shot-level importance:
)()()
_
(
Len
Diff
Len
Motion
Len
opCamera
Len
Num
LenIMP
face ∑∑∑ ××+×=
The shot-level importance function is used in the medium profile to
reassign each shot’s length according to its importance.
Static shots takes shorter, while dynamic shots can take longer.
Gets better results after editing
“muvee autoProducer” does not reassign each shot’s length!
21. Proposed Profiles
The usage of profiles allows users to customize their videos according
to its content property and users’ preference in a easy way.
We said that home videos have four types:
Causal, Non-causal, Recreational, Memorial
For causal or non-causal videos, we use the sequential or non-
sequential parameter to deal with.
For memorial or recreational videos, the rhythmic or medium
parameter is developed to cope with.
In rhythmic, the music tempo/rhythm is better preserved, while some shots
of video will be neglected.
In medium, the accompany of music tempo/rhythm is not so clear as
rhythm, but most of the shots will be promised to shown. The medium
parameter preserved the original video the most.
22. Thus we have four profiles:
Sequential Rhythmic, Sequential Medium
Non-Sequential Rhythmic, Non-Sequential Medium
Sequential Non-Sequential
Rhythmic
Time sequence of
shots will be
preserved, with the
rhythmic parameter
With the rhythmic
parameter, but the
original order of shots
will be changed.
Medium
Time sequence of
shots will be
preserved, with the
medium parameter
With the medium
parameter, but the
original order of shots
will be changed.
23. Rhythmic vs. Medium
The video is segmented according to the audio clips and sub-clips.
After projecting to the video time-line, searching in the video range to
find the video segments with the highest score as the same length as
audio segment.
Finally concatenate all the selected segments.
Video
Track
Audio
Track
24. Each shot will be reassign to a new length according to its shot
importance, shots may becomes longer or shorter in proportion to the
total length.
After projection to the video space, the length budget is calculated
according to the reduction rate; then allocate the budget to each inner
shots according to its length.
If the allocated shot length is to short (< 30 frames), then its budget will
be transfer to near shots.
Video
Track
Audio
Track
25. However, there are some issues:
The fast tempo/rhythm audio clip may be aligned to a static video
shot, which will be annoying for viewer.
The slow audio clip may be aligned to a dynamic video shot.
We apply an audio scaling coefficient in synchronization stage.
The motion intensity of video shot’s weight will be decreased when
aligned with a slow audio clip; nearly preserved when synchronized
with fast audio clip.
Another issue when the media length differ:
Video
Track
Audio
Track
It’s unavoidable when the sequential policy is enforced.
26. For some video sources, the order of shots is not so important, and re-
order shots will not degrade the original.
If we allow re-order the input video shots, things may be better:
Video
Track
Audio
Track
permutation
It sounds simple and intuitive, but it’s not an easy problem if we want to
develop an efficient algorithm to find such permutation.
Furthermore, the “best” solution may not exist and the optimal solution
may not be only one permutation.
27. Non-Sequential Permutation
So we developed a randomize algorithm to find a “not-bad” solution
within predictable computation time.
First randomly permute each video shot
Then we compute the Ravc “audio-to-video coverage” in the corresponding
time-line for each shot
Video
Audio
1=avcR 2=avcR 3=avcR
Then we calculate the average Ravc, each permutation will has its Ravc.
After lots of iterations, find the minimal Ravc, theoretically we can
approach to the optimal solution efficiently and predictable, only
depends on how many iterations we perform.
28. For an example, 10000 iterations are performed:
Permutation Minimal Ravc
7 5 8 11 3 14 13 1 2 0 9 6 12 4 10 1.455571
11 14 2 10 1 3 9 6 4 0 12 13 7 8 5 1.482213
9 7 13 1 14 6 2 10 8 0 11 4 12 3 5 1.508536
7 3 5 11 12 8 0 13 1 2 14 10 6 4 9 1.425809
13 5 2 10 3 12 7 11 0 14 9 6 8 4 1 1.453530
We can get better solution with more iterations, but through
experiments, 10000 iterations are quite enough and will not be a
burden for our computation power (actually it’s really fast)
Since its random property, each synchronization result will be different.
But we have discussed before that it’s normal to have lots of solutions.
30. Performance Evaluation
Development environment:
AMD Duron 1.2G Hz with 386 MB RAM
Analysis complexity:
For videos, about 1.2~1.3:1 comparing to the original video time.
For audios, about 2 minutes for a 5 minute audio; if perform the spectral
analysis, 4-5 minutes are needed.
The audio/video analysis will be saved as description files, so the analysis
is required only once.
The synchronization can be regarded as O(n) complexity.
When analyzing, usually less than 20 MB RAM is required (depends on
how many shots in video)
The synchronization result is saved as an AviSynth script. Then we use
VirtualDub to encode the produced musical video.
32. What’s Next?
How to design the experimental result?
The subjective test should not over-burden the viewer.
Adding the shot transition effects? Such as dissolve, fade
in, fade out.
I’ve tried, but not so easy as I thought.
The automatic approach may not always product a
satisfaction result and the experience is highly subjective
and differs from people to people.
Semi-automatic is probably the best compromise. The automatic
result is served only as a pre-process basis and a labor-saving tool.
But the video editing tool is hard to develop, and I doubt if it’s
necessary to develop one from startup on the purpose of thesis.
33. Questions and Discussion
Any comments are welcomed.
Acknowledgment:
Special thanks for Mr. 劉嘉倫 , for his videos and suggestions.
Thanks friends in DVworld who provide lots of ideas and
comments.
Thanks Chih-Hao Shen for his dancing video.