SlideShare una empresa de Scribd logo
1 de 18
Descargar para leer sin conexión
Diffusion Video Autoencoders:
Toward Temporally Consistent Face Video Editing
via Disentangled Video Encoding
Gyeongman Kim1 Hajin Shim1 Hyunsu Kim2 Yunjey Choi2 Junho Kim2 Eunho Yang1,3
1KAIST 2NAVER AI Lab 3AITRICS
Machine Learning & Intelligence Laboratory
CVPR 2023
1 min Summary
Cropped frames
!! !"#! !"
!"#!
$ !"
$
!!
$
1. Encode
2. Edit
3. Decode
Temporally
Inconsistent
⋯
!%
!%
$
Edited frames
Problem: Temporal consistency
3
Previous methods
• Face video editing: The task of modifying certain attributes of a face in a video
• All previous methods use GAN to edit faces for each frame independently
→ Modifying attributes, such as beards, causes temporal inconsistency problem
⋯
(!!"!"#
,!#$!"#
) (!!"!
,!#$!
)
(!!"#
,!#$#
)
Cropped frames
Edited frames
(!!"$
,!#$$
)
!%&
!%&
'
1. Encode
2. Edit
3. Decode
Temporally
Consistent
Solution: Decompose a video into a single identity, etc.
4
Ours
• Diffusion Video Autoencoders
• Decompose a video into {single identity 𝑧!", each frame (motion 𝑧#$!
, background 𝑧%&!
)}
• video → decomposed features z!", 𝑧#$! '()
*
, 𝑧%&! '()
*
→ video
→ Entire frame can be edited consistently with single modification of the identity feature
Paper Details
Method Overview: video autoencoding & editing pipeline
6
Video !!
":$
"%&frame autoencoding & editing
#'()
#*)
!"#
!*)
":$ !*),,-.
-)*/
!*),,-.
$%&'&(#
!'()
":$
"0123
4:5
"0123
4:5,3678
!9
(;)
!
$!
(;)
⋯
⋯
!!( $ , &"#$%
&
)
!!( $ , &"#$%
'
)
encoding
decoding
Frame !!
(;)
⋯
!!( $ , &"#$%
( ,%*+,
)
decoding
!
$!
; ,-)*/
• Design a diffusion video autoencoder:
𝑥+
(-)
→ 𝑧/012
(-)
, 𝑥*
(-)
→ 𝑥+
(-)
• High-level semantic latent 𝑧/012
(-)
(512-dim):
consist of representative identity feature 𝑧!",425
and motion feature 𝑧67"
(-)
• Noise map 𝑥*
(-)
:
Only information left out by 𝑧/012
(-)
is encoded
(=background information)
• Since background information shows high
variance to project to a low-dimensional space,
encode background at noise map 𝑥*
(-)
Method Overview: video autoencoding & editing pipeline
7
Video !!
":$
"%&frame autoencoding & editing
#'()
#*)
!"#
!*)
":$ !*),,-.
-)*/
!*),,-.
$%&'&(#
!'()
":$
"0123
4:5
"0123
4:5,3678
!9
(;)
!
$!
(;)
⋯
⋯
!!( $ , &"#$%
&
)
!!( $ , &"#$%
'
)
encoding
decoding
Frame !!
(;)
⋯
!!( $ , &"#$%
( ,%*+,
)
decoding
!
$!
; ,-)*/
• Design a diffusion video autoencoder:
𝑥+
(-)
→ 𝑧/012
(-)
, 𝑥*
(-)
→ 𝑥+
(-)
• High-level semantic latent 𝑧/012
(-)
(512-dim):
consist of representative identity feature 𝑧!",425
and motion feature 𝑧67"
(-)
• Noise map 𝑥*
(-)
:
Only information left out by 𝑧/012
(-)
is encoded
(=background information)
• Since background information shows high
variance to project to a low-dimensional space,
encode background at noise map 𝑥*
(-)
Frozen pre-trained encoders
for feature extraction
In order to nearly-perfect reconstruct,
use DDIM which utilizes deterministic
forward-backward process
Method Overview: training objective
8
Image !!
!",$
!"#$%&"
((,*, +%&'()
!",)
-*,)
!"#$%&"
((,*, +%&'()
mask
!" -*
⊕
!"#$%&" ((, *, +%&'()
×(− 1 − /!/ /! )
"!,#
Shared U-Net
(1, 2,3)
ℒ!"#$%&
ℒ'&(
-)~0(0, 2)
Estimated !!
"!
"!,$
-*,$
ℒ!"#$%&
-$~0(0, 2)
×(1/ /! )
• ℒ8!#562 = 𝔼9"~; 9" ,<!~𝒩 +,> ,' 𝜖? 𝑥', 𝑡, 𝑧/012 − 𝜖' )
• Simple version of DDPM loss
• ℒ42& = 𝔼9"~; 9" ,<#,<$~𝒩 +,> ,' 𝑓?,)⨀𝑚 − 𝑓?,@⨀𝑚 )
• For clear decomposition btw background and face information
Method Overview: training objective
9
Image !!
!",$
!"#$%&"
((,*, +%&'()
!",)
-*,)
!"#$%&"
((,*, +%&'()
mask
!" -*
⊕
!"#$%&" ((, *, +%&'()
×(− 1 − /!/ /! )
"!,#
Shared U-Net
(1, 2,3)
ℒ!"#$%&
ℒ'&(
-)~0(0, 2)
Estimated !!
"!
"!,$
-*,$
ℒ!"#$%&
-$~0(0, 2)
×(1/ /! )
• ℒ8!#562 = 𝔼9"~; 9" ,<!~𝒩 +,> ,' 𝜖? 𝑥', 𝑡, 𝑧/012 − 𝜖' )
• Simple version of DDPM loss
• ℒ42& = 𝔼9"~; 9" ,<#,<$~𝒩 +,> ,' 𝑓?,)⨀𝑚 − 𝑓?,@⨀𝑚 )
• For clear decomposition btw background and face information
Encourages the useful information
of the image to be well contained
in the semantic latent 𝑧%&'(
Effect of noise in 𝑥) on the face
region will be reduced and 𝑧%&'( will
be responsible for face features
Method Overview: video editing framework
10
!!!
!
"!!"#
!
"!" !
"#$
1. Conditional sampling with !$% for prediction of each noise level
!!!"#
%&'# !!#
%&'#
!!&
%&'#
⋯
⋯
!$
"$%
$'&
⋯
2. Conditional sampling with trainable (!"
#$%
and optimize with ℒ)*+,
ℒ)*+, ℒ)*+, ℒ)*+,
• Classifier-based editing
• Train a linear classifier for each attribute of CelebA-HQ in the identity feature 𝑧*+ space
• CLIP-based editing
• Minimize CLIP loss between intermediate images with drastically reduced number of steps 𝑆 (≪ 𝑇)
Video !!
":$
"%&frame autoencoding & editing
#'()
#*)
!"#
!*)
":$ !*),,-.
-)*/
!*),,-.
$%&'&(#
!'()
":$ "0123
4:5
"0123
4:5,3678
!9
(;)
!
$!
(;)
⋯
⋯
!!( $ , &"#$%
&
)
!!( $ , &"#$%
'
)
encoding
decoding
Frame !!
(;)
⋯
!!( $ , &"#$%
( ,%*+,
)
decoding
!
$!
; ,-)*/
Experiment: Reconstruction
11
• Our diffusion video autoencoder with T = 100 shows the best reconstruction ability and still
outperforms e4e with only T = 20
Latent Transformer
STIT
Experiment: Temporal Consistency
Original
LT
STIT
VideoEditGAN
Ours
12
• Only our diffusion video autoencoder successfully produces the temporally consistent result
Experiment: Temporal Consistency
13
• We greatly improve global consistency (TG-ID)
interpret as being consistent
as the original is when their
values are close to 1
Experiment: Editing Wild Face Videos
14
• Owing to the reconstructability of diffusion models, editing wild videos that are difficult to
inversion by GAN-based methods becomes possible.
Ours
STIT
Original
Original Latent
Transformer
STIT Ours
“young”
“gender”
“beard”
Experiment: Decomposed Features Analysis
15
Input Random !! Identity
switch
Motion
switch
Background
switch
• Generated images with switched identity, motion, and background feature confirm that the
features are properly decomposed
Experiment: Ablation Study
16
Input Recon Sampling with random !!
w/
ℒ
!"#
w/o
ℒ
!"#
w/
ℒ
!"#
w/o
ℒ
!"#
• Without the regularization loss, the identity changes significantly according to the random noise
• we can conclude that the regularization loss helps the model to decompose features effectively
Conclusions
17
• Our contribution is four-fold:
• We devise diffusion video autoencoders that decompose the video into a single time-
invariant and per-frame time-variant features for temporally consistent editing
• Based on the decomposed representation of our model, face video editing can be
conducted by editing only the single time-invariant identity feature and decoding it
together with the remaining original features
• Owing to the nearly-perfect reconstruction ability of diffusion models, our framework
can be utilized to edit exceptional cases such that a face is partially occluded by some
objects as well as usual cases
• In addition to the existing predefined attributes editing method, we propose a text-
based identity editing method based on the local directional CLIP loss for the
intermediately generated product of diffusion video autoencoders
Thank you !
Any Questions ?

Más contenido relacionado

La actualidad más candente

Forward+ (EUROGRAPHICS 2012)
Forward+ (EUROGRAPHICS 2012)Forward+ (EUROGRAPHICS 2012)
Forward+ (EUROGRAPHICS 2012)
Takahiro Harada
 
구세대 엔진 신데렐라 만들기 최종본 유트브2
구세대 엔진 신데렐라 만들기 최종본 유트브2구세대 엔진 신데렐라 만들기 최종본 유트브2
구세대 엔진 신데렐라 만들기 최종본 유트브2
Kyoung Seok(경석) Ko(고)
 
When she danced martin sherman
When she danced martin shermanWhen she danced martin sherman
When she danced martin sherman
Apavaloae Luminita
 

La actualidad más candente (20)

마칭 큐브 알고리즘 - ZP 2019 데캠
마칭 큐브 알고리즘 - ZP 2019 데캠마칭 큐브 알고리즘 - ZP 2019 데캠
마칭 큐브 알고리즘 - ZP 2019 데캠
 
04 image enhancement edge detection
04 image enhancement edge detection04 image enhancement edge detection
04 image enhancement edge detection
 
Forward+ (EUROGRAPHICS 2012)
Forward+ (EUROGRAPHICS 2012)Forward+ (EUROGRAPHICS 2012)
Forward+ (EUROGRAPHICS 2012)
 
zernike moments for image classification
zernike moments for image classificationzernike moments for image classification
zernike moments for image classification
 
Ray tracing
Ray tracingRay tracing
Ray tracing
 
The Rendering Technology of Killzone 2
The Rendering Technology of Killzone 2The Rendering Technology of Killzone 2
The Rendering Technology of Killzone 2
 
Image processing7 frequencyfiltering
Image processing7 frequencyfilteringImage processing7 frequencyfiltering
Image processing7 frequencyfiltering
 
NDC2015 유니티 정적 라이팅 이게 최선인가요
NDC2015 유니티 정적 라이팅 이게 최선인가요  NDC2015 유니티 정적 라이팅 이게 최선인가요
NDC2015 유니티 정적 라이팅 이게 최선인가요
 
Face Recognition using Raspberry PI for Door Lock System
Face Recognition using Raspberry PI for Door Lock SystemFace Recognition using Raspberry PI for Door Lock System
Face Recognition using Raspberry PI for Door Lock System
 
mssao presentation
mssao presentationmssao presentation
mssao presentation
 
Object detection
Object detectionObject detection
Object detection
 
【Unite Tokyo 2019】Unityプログレッシブライトマッパー2019
【Unite Tokyo 2019】Unityプログレッシブライトマッパー2019【Unite Tokyo 2019】Unityプログレッシブライトマッパー2019
【Unite Tokyo 2019】Unityプログレッシブライトマッパー2019
 
구세대 엔진 신데렐라 만들기 최종본 유트브2
구세대 엔진 신데렐라 만들기 최종본 유트브2구세대 엔진 신데렐라 만들기 최종본 유트브2
구세대 엔진 신데렐라 만들기 최종본 유트브2
 
Understanding neural radiance fields
Understanding neural radiance fieldsUnderstanding neural radiance fields
Understanding neural radiance fields
 
2018.02.03 이미지 텍스처링
2018.02.03 이미지 텍스처링2018.02.03 이미지 텍스처링
2018.02.03 이미지 텍스처링
 
[2013 CodeEngn Conference 09] Park.Sam - 게임 해킹툴의 변칙적 공격 기법 분석
[2013 CodeEngn Conference 09] Park.Sam - 게임 해킹툴의 변칙적 공격 기법 분석[2013 CodeEngn Conference 09] Park.Sam - 게임 해킹툴의 변칙적 공격 기법 분석
[2013 CodeEngn Conference 09] Park.Sam - 게임 해킹툴의 변칙적 공격 기법 분석
 
Chapter10 image segmentation
Chapter10 image segmentationChapter10 image segmentation
Chapter10 image segmentation
 
When she danced martin sherman
When she danced martin shermanWhen she danced martin sherman
When she danced martin sherman
 
Image enhancement techniques
Image enhancement techniques Image enhancement techniques
Image enhancement techniques
 
baptism is necessary
baptism is necessarybaptism is necessary
baptism is necessary
 

Similar a G. Kim, CVPR 2023, MLILAB, KAISTAI

EMC 3130/2130 Lecture One - Image Digital
EMC 3130/2130 Lecture One - Image DigitalEMC 3130/2130 Lecture One - Image Digital
EMC 3130/2130 Lecture One - Image Digital
Edward Bowen
 
Metadata to create and collect
Metadata to create and collectMetadata to create and collect
Metadata to create and collect
vrt-medialab
 
Scrambling For Video Surveillance
Scrambling For Video SurveillanceScrambling For Video Surveillance
Scrambling For Video Surveillance
Kobi Magnezi
 
Camera , Visual , Imaging Technology : A Walk-through
Camera , Visual ,  Imaging Technology : A Walk-through Camera , Visual ,  Imaging Technology : A Walk-through
Camera , Visual , Imaging Technology : A Walk-through
Sherin Sasidharan
 
Twiggy - let's get our widget on!
Twiggy - let's get our widget on!Twiggy - let's get our widget on!
Twiggy - let's get our widget on!
Elliott Kember
 
AppliedMagicBrochure2002
AppliedMagicBrochure2002AppliedMagicBrochure2002
AppliedMagicBrochure2002
aletawalther
 

Similar a G. Kim, CVPR 2023, MLILAB, KAISTAI (20)

EMC 3130/2130 Lecture One - Image Digital
EMC 3130/2130 Lecture One - Image DigitalEMC 3130/2130 Lecture One - Image Digital
EMC 3130/2130 Lecture One - Image Digital
 
Autom editor video blooper recognition and localization for automatic monolo...
Autom editor  video blooper recognition and localization for automatic monolo...Autom editor  video blooper recognition and localization for automatic monolo...
Autom editor video blooper recognition and localization for automatic monolo...
 
06 vdo
06 vdo06 vdo
06 vdo
 
How we optimized our Game - Jake & Tess' Finding Monsters Adventure
How we optimized our Game - Jake & Tess' Finding Monsters AdventureHow we optimized our Game - Jake & Tess' Finding Monsters Adventure
How we optimized our Game - Jake & Tess' Finding Monsters Adventure
 
Pc54
Pc54Pc54
Pc54
 
IMAGE PROCESSING
IMAGE PROCESSINGIMAGE PROCESSING
IMAGE PROCESSING
 
Mastering LOG Footage & Creating Custom Lookup Tables
Mastering LOG Footage & Creating Custom Lookup TablesMastering LOG Footage & Creating Custom Lookup Tables
Mastering LOG Footage & Creating Custom Lookup Tables
 
AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)
AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)
AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)
 
Training Videovigilancia IP: What, Why, When and How
Training Videovigilancia IP: What, Why, When and HowTraining Videovigilancia IP: What, Why, When and How
Training Videovigilancia IP: What, Why, When and How
 
Ghajini - The Game Development
Ghajini - The Game DevelopmentGhajini - The Game Development
Ghajini - The Game Development
 
Road Map
Road MapRoad Map
Road Map
 
Semi-automatic and easy creation of learning friendly OCW video content
Semi-automatic and easy creation of learning friendly OCW video contentSemi-automatic and easy creation of learning friendly OCW video content
Semi-automatic and easy creation of learning friendly OCW video content
 
Applying Media Content Analysis to the Production of Musical Videos as Summar...
Applying Media Content Analysis to the Production of Musical Videos as Summar...Applying Media Content Analysis to the Production of Musical Videos as Summar...
Applying Media Content Analysis to the Production of Musical Videos as Summar...
 
Metadata to create and collect
Metadata to create and collectMetadata to create and collect
Metadata to create and collect
 
Scrambling For Video Surveillance
Scrambling For Video SurveillanceScrambling For Video Surveillance
Scrambling For Video Surveillance
 
Camera , Visual , Imaging Technology : A Walk-through
Camera , Visual ,  Imaging Technology : A Walk-through Camera , Visual ,  Imaging Technology : A Walk-through
Camera , Visual , Imaging Technology : A Walk-through
 
Twiggy - let's get our widget on!
Twiggy - let's get our widget on!Twiggy - let's get our widget on!
Twiggy - let's get our widget on!
 
Profiling for Grown-Ups
Profiling for Grown-UpsProfiling for Grown-Ups
Profiling for Grown-Ups
 
AppliedMagicBrochure2002
AppliedMagicBrochure2002AppliedMagicBrochure2002
AppliedMagicBrochure2002
 
VFX Operations
VFX OperationsVFX Operations
VFX Operations
 

Más de MLILAB

H. Shim, NeurIPS 2018, MLILAB, KAIST AI
H. Shim, NeurIPS 2018, MLILAB, KAIST AIH. Shim, NeurIPS 2018, MLILAB, KAIST AI
H. Shim, NeurIPS 2018, MLILAB, KAIST AI
MLILAB
 

Más de MLILAB (20)

J. Jeong, AAAI 2024, MLILAB, KAIST AI..
J. Jeong,  AAAI 2024, MLILAB, KAIST AI..J. Jeong,  AAAI 2024, MLILAB, KAIST AI..
J. Jeong, AAAI 2024, MLILAB, KAIST AI..
 
J. Yun, NeurIPS 2023, MLILAB, KAISTAI
J. Yun,  NeurIPS 2023,  MLILAB,  KAISTAIJ. Yun,  NeurIPS 2023,  MLILAB,  KAISTAI
J. Yun, NeurIPS 2023, MLILAB, KAISTAI
 
S. Kim, NeurIPS 2023, MLILAB, KAISTAI
S. Kim,  NeurIPS 2023,  MLILAB,  KAISTAIS. Kim,  NeurIPS 2023,  MLILAB,  KAISTAI
S. Kim, NeurIPS 2023, MLILAB, KAISTAI
 
C. Kim, INTERSPEECH 2023, MLILAB, KAISTAI
C. Kim, INTERSPEECH 2023, MLILAB, KAISTAIC. Kim, INTERSPEECH 2023, MLILAB, KAISTAI
C. Kim, INTERSPEECH 2023, MLILAB, KAISTAI
 
Y. Jung, ICML 2023, MLILAB, KAISTAI
Y. Jung, ICML 2023, MLILAB, KAISTAIY. Jung, ICML 2023, MLILAB, KAISTAI
Y. Jung, ICML 2023, MLILAB, KAISTAI
 
J. Song, S. Kim, ICML 2023, MLILAB, KAISTAI
J. Song, S. Kim, ICML 2023, MLILAB, KAISTAIJ. Song, S. Kim, ICML 2023, MLILAB, KAISTAI
J. Song, S. Kim, ICML 2023, MLILAB, KAISTAI
 
K. Seo, ICASSP 2023, MLILAB, KAISTAI
K. Seo, ICASSP 2023, MLILAB, KAISTAIK. Seo, ICASSP 2023, MLILAB, KAISTAI
K. Seo, ICASSP 2023, MLILAB, KAISTAI
 
S. Kim, ICLR 2023, MLILAB, KAISTAI
S. Kim, ICLR 2023, MLILAB, KAISTAIS. Kim, ICLR 2023, MLILAB, KAISTAI
S. Kim, ICLR 2023, MLILAB, KAISTAI
 
Y. Kim, ICLR 2023, MLILAB, KAISTAI
Y. Kim, ICLR 2023, MLILAB, KAISTAIY. Kim, ICLR 2023, MLILAB, KAISTAI
Y. Kim, ICLR 2023, MLILAB, KAISTAI
 
J. Yun, AISTATS 2022, MLILAB, KAISTAI
J. Yun, AISTATS 2022, MLILAB, KAISTAIJ. Yun, AISTATS 2022, MLILAB, KAISTAI
J. Yun, AISTATS 2022, MLILAB, KAISTAI
 
J. Song, J. Park, ICML 2022, MLILAB, KAISTAI
J. Song, J. Park, ICML 2022, MLILAB, KAISTAIJ. Song, J. Park, ICML 2022, MLILAB, KAISTAI
J. Song, J. Park, ICML 2022, MLILAB, KAISTAI
 
J. Park, J. Song, ICLR 2022, MLILAB, KAISTAI
J. Park, J. Song, ICLR 2022, MLILAB, KAISTAIJ. Park, J. Song, ICLR 2022, MLILAB, KAISTAI
J. Park, J. Song, ICLR 2022, MLILAB, KAISTAI
 
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAIJ. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
 
J. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AIJ. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AI
 
J. Song, et. al., ASRU 2021, MLILAB, KAIST AI
J. Song, et. al., ASRU 2021, MLILAB, KAIST AIJ. Song, et. al., ASRU 2021, MLILAB, KAIST AI
J. Song, et. al., ASRU 2021, MLILAB, KAIST AI
 
J. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AI
J. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AIJ. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AI
J. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AI
 
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AIT. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
 
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AIG. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
 
I. Chung, AAAI 2020, MLILAB, KAIST AI
I. Chung, AAAI 2020, MLILAB, KAIST AII. Chung, AAAI 2020, MLILAB, KAIST AI
I. Chung, AAAI 2020, MLILAB, KAIST AI
 
H. Shim, NeurIPS 2018, MLILAB, KAIST AI
H. Shim, NeurIPS 2018, MLILAB, KAIST AIH. Shim, NeurIPS 2018, MLILAB, KAIST AI
H. Shim, NeurIPS 2018, MLILAB, KAIST AI
 

Último

DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 

Último (20)

Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 

G. Kim, CVPR 2023, MLILAB, KAISTAI

  • 1. Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding Gyeongman Kim1 Hajin Shim1 Hyunsu Kim2 Yunjey Choi2 Junho Kim2 Eunho Yang1,3 1KAIST 2NAVER AI Lab 3AITRICS Machine Learning & Intelligence Laboratory CVPR 2023
  • 3. Cropped frames !! !"#! !" !"#! $ !" $ !! $ 1. Encode 2. Edit 3. Decode Temporally Inconsistent ⋯ !% !% $ Edited frames Problem: Temporal consistency 3 Previous methods • Face video editing: The task of modifying certain attributes of a face in a video • All previous methods use GAN to edit faces for each frame independently → Modifying attributes, such as beards, causes temporal inconsistency problem
  • 4. ⋯ (!!"!"# ,!#$!"# ) (!!"! ,!#$! ) (!!"# ,!#$# ) Cropped frames Edited frames (!!"$ ,!#$$ ) !%& !%& ' 1. Encode 2. Edit 3. Decode Temporally Consistent Solution: Decompose a video into a single identity, etc. 4 Ours • Diffusion Video Autoencoders • Decompose a video into {single identity 𝑧!", each frame (motion 𝑧#$! , background 𝑧%&! )} • video → decomposed features z!", 𝑧#$! '() * , 𝑧%&! '() * → video → Entire frame can be edited consistently with single modification of the identity feature
  • 6. Method Overview: video autoencoding & editing pipeline 6 Video !! ":$ "%&frame autoencoding & editing #'() #*) !"# !*) ":$ !*),,-. -)*/ !*),,-. $%&'&(# !'() ":$ "0123 4:5 "0123 4:5,3678 !9 (;) ! $! (;) ⋯ ⋯ !!( $ , &"#$% & ) !!( $ , &"#$% ' ) encoding decoding Frame !! (;) ⋯ !!( $ , &"#$% ( ,%*+, ) decoding ! $! ; ,-)*/ • Design a diffusion video autoencoder: 𝑥+ (-) → 𝑧/012 (-) , 𝑥* (-) → 𝑥+ (-) • High-level semantic latent 𝑧/012 (-) (512-dim): consist of representative identity feature 𝑧!",425 and motion feature 𝑧67" (-) • Noise map 𝑥* (-) : Only information left out by 𝑧/012 (-) is encoded (=background information) • Since background information shows high variance to project to a low-dimensional space, encode background at noise map 𝑥* (-)
  • 7. Method Overview: video autoencoding & editing pipeline 7 Video !! ":$ "%&frame autoencoding & editing #'() #*) !"# !*) ":$ !*),,-. -)*/ !*),,-. $%&'&(# !'() ":$ "0123 4:5 "0123 4:5,3678 !9 (;) ! $! (;) ⋯ ⋯ !!( $ , &"#$% & ) !!( $ , &"#$% ' ) encoding decoding Frame !! (;) ⋯ !!( $ , &"#$% ( ,%*+, ) decoding ! $! ; ,-)*/ • Design a diffusion video autoencoder: 𝑥+ (-) → 𝑧/012 (-) , 𝑥* (-) → 𝑥+ (-) • High-level semantic latent 𝑧/012 (-) (512-dim): consist of representative identity feature 𝑧!",425 and motion feature 𝑧67" (-) • Noise map 𝑥* (-) : Only information left out by 𝑧/012 (-) is encoded (=background information) • Since background information shows high variance to project to a low-dimensional space, encode background at noise map 𝑥* (-) Frozen pre-trained encoders for feature extraction In order to nearly-perfect reconstruct, use DDIM which utilizes deterministic forward-backward process
  • 8. Method Overview: training objective 8 Image !! !",$ !"#$%&" ((,*, +%&'() !",) -*,) !"#$%&" ((,*, +%&'() mask !" -* ⊕ !"#$%&" ((, *, +%&'() ×(− 1 − /!/ /! ) "!,# Shared U-Net (1, 2,3) ℒ!"#$%& ℒ'&( -)~0(0, 2) Estimated !! "! "!,$ -*,$ ℒ!"#$%& -$~0(0, 2) ×(1/ /! ) • ℒ8!#562 = 𝔼9"~; 9" ,<!~𝒩 +,> ,' 𝜖? 𝑥', 𝑡, 𝑧/012 − 𝜖' ) • Simple version of DDPM loss • ℒ42& = 𝔼9"~; 9" ,<#,<$~𝒩 +,> ,' 𝑓?,)⨀𝑚 − 𝑓?,@⨀𝑚 ) • For clear decomposition btw background and face information
  • 9. Method Overview: training objective 9 Image !! !",$ !"#$%&" ((,*, +%&'() !",) -*,) !"#$%&" ((,*, +%&'() mask !" -* ⊕ !"#$%&" ((, *, +%&'() ×(− 1 − /!/ /! ) "!,# Shared U-Net (1, 2,3) ℒ!"#$%& ℒ'&( -)~0(0, 2) Estimated !! "! "!,$ -*,$ ℒ!"#$%& -$~0(0, 2) ×(1/ /! ) • ℒ8!#562 = 𝔼9"~; 9" ,<!~𝒩 +,> ,' 𝜖? 𝑥', 𝑡, 𝑧/012 − 𝜖' ) • Simple version of DDPM loss • ℒ42& = 𝔼9"~; 9" ,<#,<$~𝒩 +,> ,' 𝑓?,)⨀𝑚 − 𝑓?,@⨀𝑚 ) • For clear decomposition btw background and face information Encourages the useful information of the image to be well contained in the semantic latent 𝑧%&'( Effect of noise in 𝑥) on the face region will be reduced and 𝑧%&'( will be responsible for face features
  • 10. Method Overview: video editing framework 10 !!! ! "!!"# ! "!" ! "#$ 1. Conditional sampling with !$% for prediction of each noise level !!!"# %&'# !!# %&'# !!& %&'# ⋯ ⋯ !$ "$% $'& ⋯ 2. Conditional sampling with trainable (!" #$% and optimize with ℒ)*+, ℒ)*+, ℒ)*+, ℒ)*+, • Classifier-based editing • Train a linear classifier for each attribute of CelebA-HQ in the identity feature 𝑧*+ space • CLIP-based editing • Minimize CLIP loss between intermediate images with drastically reduced number of steps 𝑆 (≪ 𝑇) Video !! ":$ "%&frame autoencoding & editing #'() #*) !"# !*) ":$ !*),,-. -)*/ !*),,-. $%&'&(# !'() ":$ "0123 4:5 "0123 4:5,3678 !9 (;) ! $! (;) ⋯ ⋯ !!( $ , &"#$% & ) !!( $ , &"#$% ' ) encoding decoding Frame !! (;) ⋯ !!( $ , &"#$% ( ,%*+, ) decoding ! $! ; ,-)*/
  • 11. Experiment: Reconstruction 11 • Our diffusion video autoencoder with T = 100 shows the best reconstruction ability and still outperforms e4e with only T = 20 Latent Transformer STIT
  • 12. Experiment: Temporal Consistency Original LT STIT VideoEditGAN Ours 12 • Only our diffusion video autoencoder successfully produces the temporally consistent result
  • 13. Experiment: Temporal Consistency 13 • We greatly improve global consistency (TG-ID) interpret as being consistent as the original is when their values are close to 1
  • 14. Experiment: Editing Wild Face Videos 14 • Owing to the reconstructability of diffusion models, editing wild videos that are difficult to inversion by GAN-based methods becomes possible. Ours STIT Original Original Latent Transformer STIT Ours “young” “gender” “beard”
  • 15. Experiment: Decomposed Features Analysis 15 Input Random !! Identity switch Motion switch Background switch • Generated images with switched identity, motion, and background feature confirm that the features are properly decomposed
  • 16. Experiment: Ablation Study 16 Input Recon Sampling with random !! w/ ℒ !"# w/o ℒ !"# w/ ℒ !"# w/o ℒ !"# • Without the regularization loss, the identity changes significantly according to the random noise • we can conclude that the regularization loss helps the model to decompose features effectively
  • 17. Conclusions 17 • Our contribution is four-fold: • We devise diffusion video autoencoders that decompose the video into a single time- invariant and per-frame time-variant features for temporally consistent editing • Based on the decomposed representation of our model, face video editing can be conducted by editing only the single time-invariant identity feature and decoding it together with the remaining original features • Owing to the nearly-perfect reconstruction ability of diffusion models, our framework can be utilized to edit exceptional cases such that a face is partially occluded by some objects as well as usual cases • In addition to the existing predefined attributes editing method, we propose a text- based identity editing method based on the local directional CLIP loss for the intermediately generated product of diffusion video autoencoders
  • 18. Thank you ! Any Questions ?