G. Kim, CVPR 2023, MLILAB, KAISTAI

Diffusion Video Autoencoders:
Toward Temporally Consistent Face Video Editing
via Disentangled Video Encoding
Gyeongman Kim1 Hajin Shim1 Hyunsu Kim2 Yunjey Choi2 Junho Kim2 Eunho Yang1,3
1KAIST 2NAVER AI Lab 3AITRICS
Machine Learning & Intelligence Laboratory
CVPR 2023

Cropped frames
!! !"#! !"
!"#!
$ !"
$
!!
$
1. Encode
2. Edit
3. Decode
Temporally
Inconsistent
⋯
!%
!%
$
Edited frames
Problem: Temporal consistency
3
Previous methods
• Face video editing: The task of modifying certain attributes of a face in a video
• All previous methods use GAN to edit faces for each frame independently
→ Modifying attributes, such as beards, causes temporal inconsistency problem

⋯
(!!"!"#
,!#$!"#
) (!!"!
,!#$!
)
(!!"#
,!#$#
)
Cropped frames
Edited frames
(!!"$
,!#$$
)
!%&
!%&
'
1. Encode
2. Edit
3. Decode
Temporally
Consistent
Solution: Decompose a video into a single identity, etc.
4
Ours
• Diffusion Video Autoencoders
• Decompose a video into {single identity 𝑧!", each frame (motion 𝑧#$!
, background 𝑧%&!
)}
• video → decomposed features z!", 𝑧#$! '()
*
, 𝑧%&! '()
*
→ video
→ Entire frame can be edited consistently with single modification of the identity feature

Method Overview: video autoencoding & editing pipeline
6
Video !!
":$
"%&frame autoencoding & editing
#'()
#*)
!"#
!*)
":$ !*),,-.
-)*/
!*),,-.
$%&'&(#
!'()
":$
"0123
4:5
"0123
4:5,3678
!9
(;)
!
$!
(;)
⋯
⋯
!!( $ , &"#$%
&
)
!!( $ , &"#$%
'
)
encoding
decoding
Frame !!
(;)
⋯
!!( $ , &"#$%
( ,%*+,
)
decoding
!
$!
; ,-)*/
• Design a diffusion video autoencoder:
𝑥+
(-)
→ 𝑧/012
(-)
, 𝑥*
(-)
→ 𝑥+
(-)
• High-level semantic latent 𝑧/012
(-)
(512-dim):
consist of representative identity feature 𝑧!",425
and motion feature 𝑧67"
(-)
• Noise map 𝑥*
(-)
:
Only information left out by 𝑧/012
(-)
is encoded
(=background information)
• Since background information shows high
variance to project to a low-dimensional space,
encode background at noise map 𝑥*
(-)

Method Overview: video autoencoding & editing pipeline
7
Video !!
":$
#'()
#*)
!"#
!*)
":$ !*),,-.
-)*/
!*),,-.
$%&'&(#
!'()
":$
"0123
4:5
"0123
4:5,3678
!9
(;)
!
$!
(;)
⋯
⋯
!!( $ , &"#$%
&
)
!!( $ , &"#$%
'
)
encoding
decoding
Frame !!
(;)
⋯
!!( $ , &"#$%
( ,%*+,
)
decoding
!
$!
; ,-)*/
• Design a diffusion video autoencoder:
𝑥+
(-)
→ 𝑧/012
(-)
, 𝑥*
(-)
→ 𝑥+
(-)
• High-level semantic latent 𝑧/012
(-)
(512-dim):
consist of representative identity feature 𝑧!",425
and motion feature 𝑧67"
(-)
• Noise map 𝑥*
(-)
:
Only information left out by 𝑧/012
(-)
is encoded
(=background information)
• Since background information shows high
variance to project to a low-dimensional space,
encode background at noise map 𝑥*
(-)
Frozen pre-trained encoders
for feature extraction
In order to nearly-perfect reconstruct,
use DDIM which utilizes deterministic
forward-backward process

Method Overview: training objective
8
Image !!
!",$
!"#$%&"
((,*, +%&'()
!",)
-*,)
!"#$%&"
((,*, +%&'()
mask
!" -*
⊕
!"#$%&" ((, *, +%&'()
×(− 1 − /!/ /! )
"!,#
Shared U-Net
(1, 2,3)
ℒ!"#$%&
ℒ'&(
-)~0(0, 2)
Estimated !!
"!
"!,$
-*,$
ℒ!"#$%&
-$~0(0, 2)
×(1/ /! )
• ℒ8!#562 = 𝔼9"~; 9" ,<!~𝒩 +,> ,' 𝜖? 𝑥', 𝑡, 𝑧/012 − 𝜖' )
• Simple version of DDPM loss
• ℒ42& = 𝔼9"~; 9" ,<#,<$~𝒩 +,> ,' 𝑓?,)⨀𝑚 − 𝑓?,@⨀𝑚 )
• For clear decomposition btw background and face information

Method Overview: training objective
9
Image !!
!",$
!"#$%&"
((,*, +%&'()
!",)
-*,)
!"#$%&"
((,*, +%&'()
mask
!" -*
⊕
!"#$%&" ((, *, +%&'()
×(− 1 − /!/ /! )
"!,#
Shared U-Net
(1, 2,3)
ℒ!"#$%&
ℒ'&(
-)~0(0, 2)
Estimated !!
"!
"!,$
-*,$
ℒ!"#$%&
-$~0(0, 2)
×(1/ /! )
• ℒ8!#562 = 𝔼9"~; 9" ,<!~𝒩 +,> ,' 𝜖? 𝑥', 𝑡, 𝑧/012 − 𝜖' )
• Simple version of DDPM loss
• ℒ42& = 𝔼9"~; 9" ,<#,<$~𝒩 +,> ,' 𝑓?,)⨀𝑚 − 𝑓?,@⨀𝑚 )
• For clear decomposition btw background and face information
Encourages the useful information
of the image to be well contained
in the semantic latent 𝑧%&'(
Effect of noise in 𝑥) on the face
region will be reduced and 𝑧%&'( will
be responsible for face features

Method Overview: video editing framework
10
!!!
!
"!!"#
!
"!" !
"#$
1. Conditional sampling with !$% for prediction of each noise level
!!!"#
%&'# !!#
%&'#
!!&
%&'#
⋯
⋯
!$
"$%
$'&
⋯
2. Conditional sampling with trainable (!"
#$%
and optimize with ℒ)*+,
ℒ)*+, ℒ)*+, ℒ)*+,
• Classifier-based editing
• Train a linear classifier for each attribute of CelebA-HQ in the identity feature 𝑧*+ space
• CLIP-based editing
• Minimize CLIP loss between intermediate images with drastically reduced number of steps 𝑆 (≪ 𝑇)
Video !!
":$
#'()
#*)
!"#
!*)
":$ !*),,-.
-)*/
!*),,-.
$%&'&(#
!'()
":$ "0123
4:5
"0123
4:5,3678
!9
(;)
!
$!
(;)
⋯
⋯
!!( $ , &"#$%
&
)
!!( $ , &"#$%
'
)
encoding
decoding
Frame !!
(;)
⋯
!!( $ , &"#$%
( ,%*+,
)
decoding
!
$!
; ,-)*/

Experiment: Reconstruction
11
• Our diffusion video autoencoder with T = 100 shows the best reconstruction ability and still
outperforms e4e with only T = 20
Latent Transformer
STIT

Experiment: Temporal Consistency
Original
LT
STIT
VideoEditGAN
Ours
12
• Only our diffusion video autoencoder successfully produces the temporally consistent result

Experiment: Temporal Consistency
13
• We greatly improve global consistency (TG-ID)
interpret as being consistent
as the original is when their
values are close to 1

Experiment: Editing Wild Face Videos
14
• Owing to the reconstructability of diffusion models, editing wild videos that are difficult to
inversion by GAN-based methods becomes possible.
Ours
STIT
Original
Original Latent
Transformer
STIT Ours
“young”
“gender”
“beard”

Experiment: Decomposed Features Analysis
15
Input Random !! Identity
switch
Motion
switch
Background
switch
• Generated images with switched identity, motion, and background feature confirm that the
features are properly decomposed

Experiment: Ablation Study
16
Input Recon Sampling with random !!
w/
ℒ
!"#
w/o
ℒ
!"#
w/
ℒ
!"#
w/o
ℒ
!"#
• Without the regularization loss, the identity changes significantly according to the random noise
• we can conclude that the regularization loss helps the model to decompose features effectively

Conclusions
17
• Our contribution is four-fold:
• We devise diffusion video autoencoders that decompose the video into a single time-
invariant and per-frame time-variant features for temporally consistent editing
• Based on the decomposed representation of our model, face video editing can be
conducted by editing only the single time-invariant identity feature and decoding it
together with the remaining original features
• Owing to the nearly-perfect reconstruction ability of diffusion models, our framework
can be utilized to edit exceptional cases such that a face is partially occluded by some
objects as well as usual cases
• In addition to the existing predefined attributes editing method, we propose a text-
based identity editing method based on the local directional CLIP loss for the
intermediately generated product of diffusion video autoencoders

G. Kim, CVPR 2023, MLILAB, KAISTAI

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a G. Kim, CVPR 2023, MLILAB, KAISTAI

Similar a G. Kim, CVPR 2023, MLILAB, KAISTAI (20)

Más de MLILAB

Más de MLILAB (20)

Último

Último (20)

G. Kim, CVPR 2023, MLILAB, KAISTAI